seq_head.png 800x120
seq_nav_dist.png Databases Data Formats Genome Browser Alignments Database Search RNA Secondary Structure Webservices Literature
seq_nav_dist.png 31x20 FASTA Genbank EMBL XML  
institution_link.png 35x435 Bielefeld University Center of Biotechnoloy Institute of Bioinformatics BiBiServ  
   
XML Data Formats
exercise.png 57x15  
The wide variety of data resources we have seen in the previous chapters have been developed to support biological research. Unfortunately, different databases use different data formats what makes computational approaches to interconnect the information difficult. The Extensible Markup Language (XML) offers a way to serve and describe data in a uniform and automatically parseable format.
XML is an easily and automatically parseable way to present data on the web.  The basic representation uses standard ASCII text and therefore provides an open source solution for data migration between different programming languages, such as Java, PERL, C/C++, etc.
exercise.png 57x15  
"XML documents are made up of storage units called entities, which contain either parsed or unparsed data. Parsed data is made up of characters, some of which form character data, and some of which form markup. Markup encodes a description of the document's storage layout and logical structure. XML provides a mechanism to impose constraints on the storage layout and logical structure." [W3C] For a detailed specification see http://www.w3.org/TR/REC-xml.
   
XML and Bioinformatics
Different databases and programs use different XML representations of their data. See Paul Gordon's XML for Molecular Biology web page for an overview of bioinformatic-specific XML definitions.
Here is the GenBank leptin entry (accession:U22421 ) in TinySeq XML format:
<?xml version="1.0"?>
<!DOCTYPE TSeq PUBLIC "-//NCBI//NCBI TSeq/EN" 
 "http://www.ncbi.nlm.nih.gov/dtd/NCBI_TSeq.dtd">
<TSeq>
  <TSeq_seqtype value="nucleotide"/>
  <TSeq_gi>726296</TSeq_gi>
  <TSeq_accver>U22421.1</TSeq_accver>
  <TSeq_taxid>10090</TSeq_taxid>

  <TSeq_orgname>Mus musculus</TSeq_orgname>
  <TSeq_defline>Mus musculus obesity protein...</TSeq_defline>
  <TSeq_length>2235</TSeq_length>
  <TSeq_sequence>ATGTGCTGGAGACCCCTGTGTCGGTTC...</TSeq_sequence>
  
</TSeq>
seq_ctbg1x1.png 1x1