seq_head.png 800x120
seq_nav_dist.png Databases Data Formats Genome Browser Alignments Database Search RNA Secondary Structure Webservices Literature
seq_nav_dist.png 31x20 FASTA Genbank EMBL XML  
institution_link.png 35x435 Bielefeld University Center of Biotechnoloy Institute of Bioinformatics BiBiServ  
   
The FASTA Data Format
exercise.png 57x15  
FASTA is the most widely used sequence format. It has a very simple structure of a one line header followed by lines of sequence data:
  • The header starts with a " >" symbol.
  • The first word on this line is the name of the sequence, the rest of the line is the description of the sequence.
  • The remaining lines contain the sequence itself in IUB/IUPAC single-letter codes.
  • Blank lines in a FASTA file are ignored, and so are spaces or other gap symbols (dashes, underscores, periods) in a sequence.
  • Fasta files containing multiple sequences are just the same, with one sequence listed right after another. This format is accepted for many multiple sequence alignment programs.

The description line often depends on the database you downloaded the sequence from:

  • Genbank: gi|ginumber|gb|accession|locus
  • SwissProt: sp|accession|entry name
  • PIR: pir||entr
This is an example of a FASTA formatted file downloaded from GenBank
>gi|726297|gb|AAA64213.1| obesity protein 
            MCWRPLCRFLWLWSYLSYVQAVPIQKVQDDTKTLIKTIVTRINDISHTQSVSAKQRVTGLDFIPGLHPIL 
            SLSKMDQTLAVYQQVLTSLPSQNVLQIANDLENLRDLLHLLAFSKSCSLPQTSGLQKPESLDGVLEASLY 
            STEVVALSRLQGSLQDILQQLDVSPEC 
seq_ctbg1x1.png 1x1