seq_head.png 800x120
seq_nav_dist.png Databases Data Formats Genome Browser Alignments Database Search RNA Secondary Structure Webservices Literature
seq_nav_dist.png 31x20 FASTA Genbank EMBL XML  
institution_link.png 35x435 Bielefeld University Center of Biotechnoloy Institute of Bioinformatics BiBiServ  
   
EMBL Data Format
   
EMBL entries (as below) are structured in a way to be usable by human readers as well as by computer programs. Each entry in the database is composed of lines. Different types of lines, each with its own format, which are used to record the various types of data which make up the entry. Some entries will not contain all of the line types, and some line types occur many times in a single entry.
Each entry begins with an identification line (ID) and ends with a terminator line (//). Consult the EMBL user manual for a more comprehensive guide.
exercise.png 57x15  
General structure of an EMBL entry:
  • ID (IDentification line): always the first line of an entry. The general form of the ID line is: ID, entryname, dataclass, molecule, division, sequencelength
  • XX: contains no data or comments. It is used instead of blank lines to avoid confusion with the sequence data lines.
  • AC (Accession Number): lists the accession numbers associated with this entry.
  • SV (Sequence Version): contains the new format of the nucleotide sequence identifier.
  • DT (DaTe): shows when an entry first appeared in the the database and when it was last updated.
  • DE (DEscription): contain general descriptive information about the sequence stored.
  • KW (KeyWord): provides information which can be used to generate cross-reference indexes of the sequence entries based on functional, structural, or other categories deemed important. The keywords chosen for each entry serve as a subject reference for the sequence, and will be expanded as work with the database continues. Often several KW lines are necessary for a single entry.
  • OS (Organism Species): specifies the preferred scientific name of the organism which was the source of the stored sequence.
  • OC (Organism Classification): contain the taxonomic classification of the source organism.
  • RN (Reference Number): gives a unique number to each reference citation within an entry.
  • RC (Reference Comment): optional line type which appears if the reference has a comment.
  • RP (Reference Position): optional line type which appears if one or more contiguous base spans of the presented sequence can be attributed to the reference in question.
  • RX (Reference Cross-reference): optional line type which contains a cross-reference to an external citation or abstract database.
  • RA (Reference Author): lists the authors of the paper (or other work) cited.
  • RT (Reference Title): give the title of the paper (or other work).
  • RL (Reference Location): contains the conventional citation information for the reference.
  • DR (Database Cross-Reference): cross-references other databases which contain information related to the entry in which the DR line appears.
  • CC: free text comments about the entry, and may be used to convey any sort of information thought to be useful.
  • FH (Feature Header): present only to improve readability of an entry when it is printed or displayed on a terminal screen. The lines contain no data and may be ignored by computer programs.
  • FT (Feature Table): provide a mechanism for the annotation of the sequence data. Regions or sites in the sequence which are of interest are listed in the table. A complete and definitive description of the feature table is given here.
  • SQ (SeQuence header): marks the beginning of the sequence data and gives a summary of its content.
  • The sequence data lines has lines of code starting with two blanks. The sequence is written 60 bases per line, in groups of 10 bases separated by a blank character, beginning in position 6 of the line. The direction listed is always 5' to 3'
  • The // (terminator) line also contains no data or comments. It designates the end of an entry.
XX
AC   U22421;
XX
SV   U22421.1
XX
DT   13-APR-1995 (Rel. 43, Created)
DT   17-APR-2005 (Rel. 83, Last updated, Version 3)
XX
DE   Mus musculus obesity protein (ob) gene, complete cds.
XX
KW   .
XX
OS   Mus musculus (house mouse)
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia;
OC   Eutheria; Euarchontoglires; Glires; Rodentia; Sciurognathi; Muridae;
OC   Murinae; Mus.
XX
RN   [1]
RP   1-2235
RA   Chehab F.F., Lim M.E.;
RT   "Genomic organization and sequence of the mouse obesity gene";
RL   Unpublished.
XX
RN   [2]
RP   1-2235
RA   Chehab F.F., Lim M.E.;
RT   ;
RL   Submitted (09-MAR-1995) to the EMBL/GenBank/DDBJ databases.
RL   Farid F. Chehab, Laboratory Medicine, University of California, San
RL   Francisco, 505 Parnassus Avenue, San Francisco, CA 94143-0134, USA
XX
DR   MGI; 104663; Lep.
XX
FH   Key             Location/Qualifiers
FH
FT   source          1..2235
FT                   /chromosome="6"
FT                   /db_xref="taxon:10090"
FT                   /mol_type="genomic DNA"
FT                   /organism="Mus musculus"
FT                   /strain="C57BL/6J"
FT   CDS             join(1..144,1876..2235)
FT                   /codon_start=1
FT                   /db_xref="GOA:P41160"
FT                   /db_xref="HSSP:1AX8"
FT                   /db_xref="InterPro:IPR000065"
FT                   /db_xref="InterPro:IPR009079"
FT                   /db_xref="UniProt/Swiss-Prot:P41160"
FT                   /gene="ob"
FT                   /product="obesity protein"
FT                   /protein_id="AAA64213.1"
FT                   /translation="MCWRPLCRFLWLWSYLSYVQAVPIQKVQDDTKTLIKTIVTRINDI
FT                   SHTQSVSAKQRVTGLDFIPGLHPILSLSKMDQTLAVYQQVLTSLPSQNVLQIANDLENL
FT                   RDLLHLLAFSKSCSLPQTSGLQKPESLDGVLEASLYSTEVVALSRLQGSLQDILQQLDV
FT                   SPEC"
FT   intron          145..1875
FT   repeat_region   449..585
FT   misc_feature    1876..1879
FT                   /note="slippage of acceptor site results in inclusion or
FT                   exclusion of glutamine at amino acid position 49"
FT                   /gene="ob"
XX
SQ   Sequence 2235 BP; 568 A; 571 C; 547 G; 549 T; 0 other;
     atgtgctgga gacccctgtg tcggttcctg tggctttggt cctatctgtc ttatgttcaa        60
     gcagtgccta tccagaaagt ccaggatgac accaaaaccc tcatcaagac cattgtcacc       120
     aggatcaatg acatttcaca cacggtagga gtcttatggg gggacaaaga tgtaggacta       180
     gaaccagagt ctgagaaaca tgtcatgcac ctcctagaag ctgagagttt ataagcctcg       240
     agtgtacatt atctctggtc atggctcttg tcactgctgc ctgctgaaat acagggctga       300
     gtggttccat ttctaaaccc agcatctaga ctgctcagct gtactgccag tatcgcatga       360
     ttctaatcct aagccacctt agggaattta acttctctct tatactccca ttaagaaacc       420
     ataaggtgtc gggcgtggtg gcacatgccc tctaatccca gaactcggga ggcagaggca       480
     ggtggatttc tgagttcaag gccagcctgg tctacaaaat gagttccagg acagccaggg       540
     ctatacagag aaaccctgtc tcgaaaaacc aaaaaagaag ccataaggtt ctttgatatc       600
     ataaggccat gctcattttc cctctgccac aggaaaccca gcccttggtg gctagctgag       660
     catgtaaggt acacatcaga cctgggagaa cctgggttcc tccctgcttc cacagaccac       720
     cctctcccct tccttagccc cctgtttctg cctctctcat tctctttcat ccatgaaact       780
     acttccttga atttagtacc cagggcataa gaatccctaa aggtcatggt gtcccattga       840
     cacgtggaca gcttcccagc agtgtctcta ctgggcagga ggagcagtag gtttctaatg       900
     gtttttagct acagcttctg cccaccgctc acccactttt caaagtcaca cagaaaacaa       960
     cctttccctc ctttacaacc agtccttgtg tagctgctga tagtggtcgg tgcccaccat      1020
     gttcttcctc cgaggcccag cagcctacat cttcagccat ttcctcagat gtatctaagc      1080
     tatgtgcata tcaccatatc tgcttcccat ctgcaagatc ttaggccagt tctccggtgg      1140
     gttttaaacc ttgattttac catcttgatg agggagacat catatcatat caccaagttg      1200
     ttctaaggct taaatggggt gtagtgaaag actttcttgt ggagccatct ggagactact      1260
     atgtctcctg accagtgtgc gtgtctcaca gtgtggcctt ggcagctagg agaagtcaga      1320
     tattcagaat caagggacag cttaatataa gagacttatg cggagaaagt tctcatcatc      1380
     tctcgacaag agtcatcagg gctgcacatg gagaggccca actacccaaa tgtgggtgga      1440
     aatgagagga agccagtggg gaaagccctt cctggtaacc agactcagca gagtgggggg      1500
     ggggggcacg gctttgaccc taatgaggga gaaccacaga agagtatgac taggagggag      1560
     agatctgata agggcaggag gctagagaga atataaggaa taaagagcta tggctggttc      1620
     ttcacggata tcattggaga aaggaattac tcaagactaa tcagaagtga agggtggagt      1680
     gactcggaat gatcagaaag tccgggagac cagctccgtg gcttccagtc agctgatgac      1740
     aggaagtaag gacctggacc aggaaggtga gaaggaagga ggtagcccag gttcacagat      1800
     gtaatgtaga gctctggagc ccgatgctcc ctgccacttg ctaaaacacc tcttgttctt      1860
     cttcctcctc catatcagtc ggtatccgcc aagcagaggg tcactggctt ggacttcatt      1920
     cctgggcttc accccattct gagtttgtcc aagatggacc agactctggc agtctatcaa      1980
     caggtcctca ccagcctgcc ttcccaaaat gtgctgcaga tagccaatga cctggagaat      2040
     ctccgagacc tcctccatct gctggccttc tccaagagct gctccctgcc tcagaccagt      2100
     ggcctgcaga agccagagag cctggatggc gtcctggaag cctcactcta ctccacagag      2160
     gtggtggctt tgagcaggct gcagggctct ctgcaggaca ttcttcaaca gttggatgtt      2220
     agccctgaat gctga                                                       2235
//
seq_ctbg1x1.png 1x1