BiBiServ Logo
Attention:
Due to technical maintenance some tools might be unavailable.
See maintenance information.
BiBiServ -
                                    Bielefeld         University Bioinformatic Service
Tools
Education
Administration
Tools
Genome Comparison
Gecko
REPuter
...more
Alignments
PoSSuMsearch2
ChromA
...more
Primer Design
GeneFisher2
RNA Studio
RNAshapes
KnotInFrame
RNAhybrid
...more
Evolutionary Relationship
ROSE
...more
Others
XenDB
jPREdictor
...more

E2G - Manual

Upload of genomic sequence

The genomic sequence has to be in FASTA format, otherwise running E2G returns an error. Here, a single sequence can be uploaded only. The sequence will be masked against simple repeats by default. If the user wants to upload an already masked sequence, repetitive regions have to be masked as lowercase characters. Only capital characters will be used during the matching procedure.

Note: not masking your sequence can result in many thousands or even million hits, depending on the length of your sequence and matching parameters! Please make sure that your sequence does not include repeats if you choose to not mask your genomic sequence.

Because of potential long upload times, the maximum size for uploaded sequence data is currently 5 MB.

RepeatMasker

By default, the uploaded genomic sequence is scanned for simple repetitive and low complexity regions using RepeatMasker.

EST/cDNA Set

The user has two options to match the uploaded genome sequence against an EST set:
  • the sequence can be matched against an pre-indexed EST set (currently Mouse or Human)
  • a set of EST's can be uploaded (in multiple FASTA format) and used for matching


Annotation format

The start and the finish for each gene, as well as the name should be listed on one line. A greater than (>) or less than (<) sign should be placed before this line to indicate plus strand or minus strand, although the numbering should be according to the plus strand. The exons should then be listed individually with the word "exon", after the start and finish of each exon. UTRs are annotated the same way an exon is, with the word "utr" replacing "exon". For example:
< 106481 116661 unknown 
106481 106497 utr 
107983 108069 exon 
109884 110033 exon 
111865 112023 exon 
114352 114562 exon 
116587 116661 utr

> 39424 42368 GDF 9 
39424 39820 exon 
41401 42368 exon

< 77817 81088 hypothetical 
77817 78820 utr 
79538 80107 exon 
80193 80334 exon 
80435 80707 exon 
80829 81088 exon 


Match parameters

  • leastscore - specify the minimum score of a match. Matching bases are scored 2, mismatches scored -1 and indels are scored -2.
  • identity - specify the minimum identity of a match in range [80%-100%]
  • exdrop - specify the Xdrop-score when extending a seed in both directions allowing for matches, mismatches, insertions, and deletions.
  • length - specify the minimal length of a hit
  • seedlength - specify the seedlength, i.e. the length of the perfect match that seeds the alignment


Filter options

By default, all ESTs with only a single hit of length less than 200 are discarded. These hits are usually caused by repeats or low complexity regions missed by RepeatMasker or contaminations in the cDNA libraries. The maximum count of matches which are shown in one query can be hard limited separately for cDNAs and ESTs. The user can turn off this features or decrease/increase the values, although, depending on the matching parameters this can generate many thousands of spurious hits which will be displayed in the web interface.

FASTA format

A sequence in FASTA format begins with a single-line description, followed by lines of sequence data.
  • The description line starts with a greater than symbol (">").
  • The word following the greater than symbol (">") immediately is the "ID" (name) of the sequence, the rest of the line is the description.
  • The "ID" and the description are optional.
  • all lines of text should be shorter than 80 (normally 60) characters.
  • the sequence ends if there is another greater than symbol (">") symbol at the beginning of a line and another sequence begins.
The following example contains one sequence (sequence_1):
>sequence_1
caagcacagaaacctatggcataaatccctctgagacgcgttgtactatggttatctaat
tctccggcgacacaagttgtctaaccgtgatcaccttaaagggcaagccgcccaatagat
gttagttaatactacgtaccaagtatgcctgcgcttggtaaagccgcctgtccatagttc
tactagggtagagcttcaggatgctctatagttcgagcggttctttgatcaactcgacta
gctaccaccatgtctgtgttttattgcacgcaaagtcgtaagtttaaacggaccaagaag
ccttcttcggtcagtagcaggttaagggccaagtacaagcctctccaggaatgcttaacg
gcatcgatgcaacttggacaagtaaacatcctgaagctta


E2G web interface

Screenshot of the E2G web interface ( full version
The Figure above shows a screenshot of the graphical overview produced by E2G when uploading a 16.5Kbp genomic sequence from M. musculus (Genbank GI: 28515921, bases 60,000-76,500) to compare it to 4.1 million ESTs from the same species. (Click on the image to see the original output.) The overview is split into five panels, arranged from top to bottom:
  1. General Information Panel: The top of the window provides general information about the current task. The user can zoom into a region of interest within the submitted genomic sequence. The positions of the highlighted matches in the EST and the genomic sequence are displayed. This part of the overview also provides links to download the sequences or GI numbers of matching ESTs.
  2. Annotation Panel: The second section of the window shows gene predictions for the genomic sequence, as uploaded by the user (orange colored) and delivered by GenScan (blue colored). If the prediction refers to the forward strand, then the exons are shown above the line representing the genome, otherwise below.
  3. cDNA Mapping Panel: cDNA matches on the genomic sequence are shown as colored blocks. Forward matches are shown in green, reverse complemented matches are shown inred.
  4. EST Mapping Panel: EST matches on the genomic sequence are shown in the same way as cDNA matches. The two kinds of matches are separated since cDNAs are usually of higher quality and thus matches to the genomic sequence are more reliable.
  5. Mapping Summary Panel: The bottom panel provides a summary of all matches, shown as colored boxes. The color code represents the coverage of a region, i.e. the relative number of matches in the region. For example, in the figure above , regions with high coverage are represented by red boxes and regions with low coverage by blue boxes.
The GenScan and uploaded annotation from the annotation panel can be superimposed to the cDNA and EST matches, by dragging a transparent image over the lower part of the window. For web browsers that do not support JavaScript mouse events, the up/down buttons in the upper right corner of the window provide the same functionality. The transparent image conveniently allows the user to compare the gene prediction to the matches found. By clicking on a match, an alignment (computed by Vmatch) between this individual region of the EST and the genomic sequence is shown in a popup window. The alignment is supplemented by additional information such as positions in the genomic sequence and in the EST, scores, identity values, and E-values (see Figure, bottom). Additionally, sim4 is run to produce a spliced alignment over the whole EST sequence, whenever the button on the right is clicked.

Welcome
Submission
References
Manual
Demo
Contact
Tue Dec 22 11:13:47 2009