Sequence Analysis with Distributed Resources - Data Formats

		XML Data Formats
		The wide variety of data resources we have seen in the previous chapters have been developed to support biological research. Unfortunately, different databases use different data formats what makes computational approaches to interconnect the information difficult. The Extensible Markup Language (XML) offers a way to serve and describe data in a uniform and automatically parseable format. XML is an easily and automatically parseable way to present data on the web. The basic representation uses standard ASCII text and therefore provides an open source solution for data migration between different programming languages, such as Java, PERL, C/C++, etc.
		"XML documents are made up of storage units called entities, which contain either parsed or unparsed data. Parsed data is made up of characters, some of which form character data, and some of which form markup. Markup encodes a description of the document's storage layout and logical structure. XML provides a mechanism to impose constraints on the storage layout and logical structure." [W3C] For a detailed specification see http://www.w3.org/TR/REC-xml.
		XML and Bioinformatics Different databases and programs use different XML representations of their data. See Paul Gordon's XML for Molecular Biology web page for an overview of bioinformatic-specific XML definitions.
Here is the GenBank leptin entry (accession:U22421 ) in TinySeq XML format: <?xml version="1.0"?> <!DOCTYPE TSeq PUBLIC "-//NCBI//NCBI TSeq/EN" "http://www.ncbi.nlm.nih.gov/dtd/NCBI_TSeq.dtd"> <TSeq> <TSeq_seqtype value="nucleotide"/> <TSeq_gi>726296</TSeq_gi> <TSeq_accver>U22421.1</TSeq_accver> <TSeq_taxid>10090</TSeq_taxid> <TSeq_orgname>Mus musculus</TSeq_orgname> <TSeq_defline>Mus musculus obesity protein...</TSeq_defline> <TSeq_length>2235</TSeq_length> <TSeq_sequence>ATGTGCTGGAGACCCCTGTGTCGGTTC...</TSeq_sequence> </TSeq>