|
|
DIALIGN - Manual
DIALIGN requires a single ASCII file containing the sequences
to be aligned. Four different file formats are supported: IG,
FASTA, EMBL and GCG-RSF format. The following is an example of
the FASTA sequence file format:
>HTL2
LDTAPCLFSDGSPQKAAYVLWDQTILQQDITPLPSHETHSAQKGELLALICGLRAAKPWP
SLNIFLDSKYLIKYLHSLAIGAFLGTSAHQTLQAALPPLLQGKTIYLHHVRSHTNLPDPI
STFNEYTDSLILAPL
>MMLV
PDADHTWYTDGSSLLQEGQRKAGAAVTTETEVIWAKALDAGTSAQRAELIALTQALKMAE
GKKLNVYTDSRYAFATAHIHGEIYRRRGLLTSEGKEIKNKDEILALLKALFLPKRLSIIH
CPGHQKGHSAEARGNRMADQAARKAAITETPDTSTLL
>HEPB
RPGLCQVFADATPTGWGLVMGHQRMRGTFSAPLPIHTAELLAACFARSRSGANIIGTDNS
VVLSRKYTSFPWLLGCAANWILRGTSFVYVPSALNPADDPSRGRLGLSRPLLRLPFRPTT
GRTSLYADSPSVPSHLPDRVH
>ECOL
MLKQVEIFTDGSCLGNPGPGGYGAILRYRGREKTFSAGYTRTTNNRMELMAAIVALEALK
EHCEVILSTDSQYVRQGITQWIHNWKKRGWKTADKKPVKNVDLWQRLDAALGQHQIKWEW
VKGHAGHPENERCDELARAAAMNPTLEDTGYQVEV
For each sequence, the first line starts with ">" and
contains the name of the sequence.
-
Sequence
Type:
The user can decide if nucleic acid or protein sequences
are to be aligned.
-
Threshold
T:
As described in our papers,
DIALIGN constructs alignments from gapfree pairs of
segments of the sequences. Such segment pairs are
referred to as `diagonals'.
Every possible diagonal is given a so-called weight
reflecting the degree of similarity among the two segments
involved. The overall score of an alignment ist then defined
as the sum of weights of the diagonals it consists of and the
program tries to find an alignment with maximum score -- in
other words: the program tries to find a consistent
collection of diagonals with maximum sum of weights. This
novel scoring scheme for alignments is the basic
difference between DIALIGN and other global or local
alignment methods. Note that DIALIGN does not employ any
kind of gap penalty.
It is possible to use a threshold T for the quality
of the diagonals. In this case, diagonals are considered only
if their `weights' exceed this threshold, and regions of
lower similarity are ignored.
In the first version of the program (DIALIGN 1), this
threshold was in many situations absolutely necessary to
obtain meaningful alignments. By contrast, DIALIGN 2 should
produce reasonable alignments without a threshold,
i.e. with T = 0. This is the most important
difference between DIALIGN 2 and the first version of the
program.
Nevertheless, it is still possible to use a threshold
T, so it is up to the user to experience with this
option.
-
Translation of
`nucleotide diagonals' into `peptide diagonals':
If (possibly) coding nucleic acid sequences are to be
aligned, DIALIGN optionally translates the compared `nucleic
acid segments' to `peptide segments' according to the genetic
code -- without (necessarily) presupposing any of the three
possible reading frames, so all three of them get checked for
significant similarity. In this case, the similarity among
segments will be assessed on the `peptide level' rather than
on the `nucleic acid level'. We strongly recommend this
option if nucleic acid sequences are expected to contain
protein coding regions, as it will significantly increase the
sensitivity of the alignment procedure in such cases.
-
`*' characters:
The user can specify the maximum number of `*' characters indicating the degree of local
similarity among sequences.
Similarity Matrix:
DIALIGN 2 employs the BLOSUM62 amino acid substitution
matrix.
DIALIGN creates a file containing
- An alignment of the input sequences in DIALIGN format.
- The same alignment in FASTA
format.
- A sequence tree in PHYLIP
format. This tree is constructed by applying the UPGMA
clustering method to the DIALIGN similarity scores. It roughly
reflects the different degrees of similarity among sequences.
For detailed phylogenetic analysis, we recommend the usual
methods for phylogenetic reconstruction.
HTL2 1 ldtapcLFSD GS------PQ KAAYVLWDQT IL---QQDIT PLPSHethSA
MMLV 1 pdadhtwYTD GSSLLQEGQR KAGAAVTTET eviwaKALDA G---T---SA
HEPB 1 rpglcQVFAD AT------PT GWGLVMGHQR MR---GTFSA PLPIHt----
ECOL 1 mlkqvEIFTD GSCLGNPGPG GYGAILRYRG RE---KTFSA GytrT---TN
***** ********** ********** ** ***** ***** **
**** ** ** ********** ** ***** ***** **
*** ** ** ********** ** *****
** ******
HTL2 42 QKGELLALIC GLRAAKPWPS LNIFLDSKYL IKYLHslaig aflgtsah--
MMLV 45 QRAELIALTQ ALKMAEgkk- LNVYTDSRYA FATAHIHGEI YRRRGLLTSE
HEPB 38 --AELLAACF Arsrsgan-- -IIGTDN--- ---------- ----------
ECOL 45 NRMELMAAIV ALEALKEHCE VILSTDSQYV RQGITQWIHN WKKRGWKTAD
********** ********** ********** ********** **********
********** ********** ********** ********** **********
******* ****** ********** *****
******* ****** ********** *****
********
HTL2 90 -------QT- --LQAALPPL LQGKTIYLHH VRSHT----- -NLPDPISTF
MMLV 94 GKEIKNKDE- --ILALLKAL FLPKRLSIIH CPGHQ----- -KGHSAEARG
HEPB 60 ---------- ---SVVLSR- ---------- ---KYTSFPW LLGCAANWI-
ECOL 95 KKPVKNVDlw qrLDAALGQ- ---------- ---HQIKWEW VKGHAGHPE-
********* ******** ********** ********** **********
********
*
HTL2 124 NEYTDSLILA pl-------- ---------- ---------- ----------
MMLV 135 NRMADQAARK AAITETPDTS tll------- ---------- ----------
HEPB 82 LRGTSFVYVP SALNPADDPS rgrlglsrpl lrlpfrpttg rtslyadsps
ECOL 130 NERCDELARA AAMNPTledt gyqvev---- ---------- ----------
********** **********
********** ******
HTL2 136 ----------
MMLV ----------
HEPB 132 vpshlpdrvh
ECOL 156 ----------
- Names of the aligned sequences are shown on the left hand
side of the alignment.
- Numbers on the left hand side of the alignment denote the
position of the first residue in a line within the respective
sequence.
- Capital letters denote aligned residues, i.e. residues
involved in at least one of the `diagonals' the alignment
consists of. Lower-case letters denote residues not
belonging to any of these selected `diagonals'. They are
not considered to be aligned by DIALIGN. Thus, if a
lower-case letter is standing in the same column with other
letters, this is pure chance; these residues are not
considered to be homologous.
- The number of `*' characters
below the alignment reflects the degree of local similarity
among sequences. More precisely: They represent the sum of
`weights' of diagonals connecting residues at the respective
position. The number of `*' characters is normalized such that
regions of maximum similarity have N `*' characters per
column. N can be specified by the user. By default,
N = 5. Note that the number of `*' characters depicts
the relative degree of similarity within an alignment,
since in every alignment, the region of maximum
similarity gets N `*' characters.
This is FASTA
alignment format:
>HTL2
ldtapcLFSDGS------PQKAAYVLWDQTIL---QQDITPLPSHethSA
QKGELLALICGLRAAKPWPSLNIFLDSKYLIKYLHslaigaflgtsah--
-------QT---LQAALPPLLQGKTIYLHHVRSHT------NLPDPISTF
NEYTDSLILApl--------------------------------------
----------
>MMLV
pdadhtwYTDGSSLLQEGQRKAGAAVTTETeviwaKALDAG---T---SA
QRAELIALTQALKMAEgkk-LNVYTDSRYAFATAHIHGEIYRRRGLLTSE
GKEIKNKDE---ILALLKALFLPKRLSIIHCPGHQ------KGHSAEARG
NRMADQAARKAAITETPDTStll---------------------------
----------
>HEPB
rpglcQVFADAT------PTGWGLVMGHQRMR---GTFSAPLPIHt----
--AELLAACFArsrsgan---IIGTDN-----------------------
-------------SVVLSR--------------KYTSFPWLLGCAANWI-
LRGTSFVYVPSALNPADDPSrgrlglsrpllrlpfrpttgrtslyadsps
vpshlpdrvh
>ECOL
mlkqvEIFTDGSCLGNPGPGGYGAILRYRGRE---KTFSAGytrT---TN
NRMELMAAIVALEALKEHCEVILSTDSQYVRQGITQWIHNWKKRGWKTAD
KKPVKNVDlwqrLDAALGQ--------------HQIKWEWVKGHAGHPE-
NERCDELARAAAMNPTledtgyqvev------------------------
----------
((HTL2:0.111024,
(MMLV:0.078471,
ECOL:0.078471):0.032554):0.121218,
HEPB:0.232242);
Trees can be visualized using the drawtree program
contained in the PHYLIP
software package.
|
|