|
|
E2G - Manual
The genomic sequence has to be in FASTA format, otherwise running E2G returns an
error. Here, a single sequence can be uploaded only. The
sequence will be masked against simple repeats by default. If the
user wants to upload an already masked sequence, repetitive
regions have to be masked as lowercase characters. Only
capital characters will be used during the matching
procedure.
Note: not masking your sequence can result in many
thousands or even million hits, depending on the length of your
sequence and matching parameters! Please make sure that your
sequence does not include repeats if you choose to not mask your
genomic sequence.
Because of potential long upload times, the maximum size for
uploaded sequence data is currently 5 MB.
By default, the uploaded
genomic sequence is scanned for simple repetitive and low
complexity regions using RepeatMasker.
The user has two options to match the uploaded genome
sequence against an EST set:
- the sequence can be matched against an pre-indexed EST set
(currently Mouse or Human)
- a set of EST's can be uploaded (in multiple FASTA format)
and used for matching
The start and the finish for each gene, as well as
the name should be listed on one line. A greater than (>) or
less than (<) sign should be placed before this line to
indicate plus strand or minus strand, although the numbering
should be according to the plus strand. The exons should then be
listed individually with the word "exon", after the start and
finish of each exon. UTRs are annotated the same way an exon is,
with the word "utr" replacing "exon". For example:
< 106481 116661 unknown
106481 106497 utr
107983 108069 exon
109884 110033 exon
111865 112023 exon
114352 114562 exon
116587 116661 utr
> 39424 42368 GDF 9
39424 39820 exon
41401 42368 exon
< 77817 81088 hypothetical
77817 78820 utr
79538 80107 exon
80193 80334 exon
80435 80707 exon
80829 81088 exon
- leastscore - specify the minimum score of a match.
Matching bases are scored 2, mismatches scored -1 and indels
are scored -2.
- identity - specify the minimum identity of a match
in range [80%-100%]
- exdrop - specify the Xdrop-score when extending a
seed in both directions allowing for matches, mismatches,
insertions, and deletions.
- length - specify the minimal length of a hit
- seedlength - specify the seedlength, i.e. the length
of the perfect match that seeds the alignment
By default, all ESTs with only a single hit of
length less than 200 are discarded. These hits are usually caused
by repeats or low complexity regions missed by RepeatMasker or
contaminations in the cDNA libraries. The maximum count of
matches which are shown in one query can be hard limited
separately for cDNAs and ESTs. The user can turn off this
features or decrease/increase the values, although, depending on
the matching parameters this can generate many thousands of
spurious hits which will be displayed in the web interface.
A sequence in
FASTA format begins with a single-line description, followed by
lines of sequence data.
- The description line starts with a greater than symbol
(">").
- The word following the greater than symbol (">")
immediately is the "ID" (name) of the sequence, the rest of the
line is the description.
- The "ID" and the description are optional.
- all lines of text should be shorter than 80 (normally 60)
characters.
- the sequence ends if there is another greater than symbol
(">") symbol at the beginning of a line and another sequence
begins.
The following example contains one sequence (sequence_1):
>sequence_1
caagcacagaaacctatggcataaatccctctgagacgcgttgtactatggttatctaat
tctccggcgacacaagttgtctaaccgtgatcaccttaaagggcaagccgcccaatagat
gttagttaatactacgtaccaagtatgcctgcgcttggtaaagccgcctgtccatagttc
tactagggtagagcttcaggatgctctatagttcgagcggttctttgatcaactcgacta
gctaccaccatgtctgtgttttattgcacgcaaagtcgtaagtttaaacggaccaagaag
ccttcttcggtcagtagcaggttaagggccaagtacaagcctctccaggaatgcttaacg
gcatcgatgcaacttggacaagtaaacatcctgaagctta
The Figure above shows a screenshot of the graphical
overview produced by E2G when uploading a 16.5Kbp genomic
sequence from M. musculus (Genbank GI: 28515921, bases
60,000-76,500) to compare it to 4.1 million ESTs from the same
species. (Click on the image to see the original output.) The
overview is split into five panels, arranged from top to bottom:
- General Information Panel: The top of the window
provides general information about the current task. The user
can zoom into a region of interest within the submitted genomic
sequence. The positions of the highlighted matches in the EST
and the genomic sequence are displayed. This part of the
overview also provides links to download the sequences or GI
numbers of matching ESTs.
- Annotation Panel: The second section of the window
shows gene predictions for the genomic sequence, as uploaded by
the user (orange colored) and delivered by GenScan (blue
colored). If the prediction refers to the forward strand, then
the exons are shown above the line representing the genome,
otherwise below.
- cDNA Mapping Panel: cDNA matches on the genomic
sequence are shown as colored blocks. Forward matches are shown
in green, reverse complemented matches are shown inred.
- EST Mapping Panel: EST matches on the genomic
sequence are shown in the same way as cDNA matches. The two
kinds of matches are separated since cDNAs are usually of
higher quality and thus matches to the genomic sequence are
more reliable.
- Mapping Summary Panel: The bottom panel provides a
summary of all matches, shown as colored boxes. The color code
represents the coverage of a region, i.e. the relative number
of matches in the region. For example, in the figure above ,
regions with high coverage are represented by red boxes and
regions with low coverage by blue boxes.
The GenScan and uploaded annotation from the annotation
panel can be superimposed to the cDNA and EST matches, by
dragging a transparent image over the lower part of the window.
For web browsers that do not support JavaScript mouse events, the
up/down buttons in the upper right corner of the window provide
the same functionality. The transparent image conveniently allows
the user to compare the gene prediction to the matches found. By
clicking on a match, an alignment (computed by Vmatch) between
this individual region of the EST and the genomic sequence is
shown in a popup window. The alignment is supplemented by
additional information such as positions in the genomic sequence
and in the EST, scores, identity values, and E-values (see
Figure, bottom). Additionally, sim4 is run to produce a spliced
alignment over the whole EST sequence, whenever the button on the
right is clicked.
|
|