|
 |
 |
How to Identify and to Evaluate Diagnostic Sites Computationally
Given an alphabet
(including the ``gap letter''), an indexed family of aligned sequences
(
in some index set - e.g. the numbers from 1 to 91 if we deal with
all of the 91 vertebrate serpin sequences all
consisting, by virtue of the alignment, of
exactly letters from , and any subclass of these
sequences specified by the corresponding subset of the index set
(e.g. the subclass of certified ovalbumin-type sequences specified by
the numbers from 1 to 12), we
can form the profile of the subfamily ( )
which we define, for each site (
), to be the
-tuple
of observed frequencies
and we can compare this -tuple with the corresponding
-tuple defined for the complement of in , e.g.,
by forming their -distance
Clearly, we have always
and
and, hence,
with equality if and only if the collection
of letters occuring at site within the subfamily specified by
is disjoint from the collection
defined
correspondingly with the complement of replacing .
Consequently, the site `` '' is a diagnostic site for the
subfamily in question if and only if
holds because this is clearly
equivalent to asserting that membership of an index to the
subset
can be checked by considering the -th
letter in the sequence : If this letter is in
, the sequence belongs to the given subfamily (that
is, then must belong to ),
otherwise it belongs to its complement(that
is, then must belong to ). More generally, the site
`` '' is almost diagnostic for if
is close to .
Using this approach, diagnostic sites
for vertebrate serpins have been computed for the six groups of
vertebrate serpins suggested by genomic organisation (see diagnostic sites for verterbrate
serpins for details).
This document was generated using the LaTeX2HTML
translator.
|
 |
 |
|
|