
Basic Notions

We consider a string S of length n. We index the characters of
S from 0 to n - 1, i.e.
S = S0S1...Sn - 1. By
S-1 we denote the reverse of S, i.e. the string u of length
n such that
ui = Sn - 1 - i for
any
i
[0, n - 1]. Let
= {a, c, g, t} be the DNA-alphabet. We define
a function
wcc : 

by the following equations:
| wcc(a) |
= |
t |
|
| wcc(g) |
= |
c |
|
| wcc(c) |
= |
g |
|
| wcc(t) |
= |
a |
|
We extend wcc to any sequence
U = U0U1...Um over
by defining
wcc(U) = wcc(U0)wcc(U1)...wcc(Un - 1)
The Hamming distance of two equal-length strings u and v,
denoted by dH(u, v), is the number of positions where
u and v differ, that is,
dH(u, v) = |{i
[1,| u|] | ui
vi}|.
There are three kinds of edit operations: deletions,
insertions, and mismatches of single characters.
The edit distance or Levenshtein distance of u and
v, denoted by
dE(u, v), is the minimum number of edit operations
needed to transform u into v.
Let
be a relation on pairs of DNA-sequences.
Let l, r > 0,
i
[0, n - l] and
j
[0, n - r]. We define four different
kinds of repeats:
- (l, i, r, j) is a forward or direct repeat if and only if i < j and
SiSi + 1...Si + l - 1 SjSj + 1...Sj + r - 1
|
(1) |
- (l, i, r, j) is a reversed repeat if and only if i
j and
SiSi + 1...Si + l - 1 (SjSj + 1...Sj + r - 1)-1
|
(2) |
- (l, i, r, j) is a complemented repeat if and only if i
j and
SiSi + 1...Si + l - 1 wcc(SjSj + 1...Sj + r - 1)
|
(3) |
- (l, i, r, j) is a palindromic repeat if and only if i
j and
SiSi + 1...Si + l - 1 (wcc(SjSj + 1...Sj + r - 1))-1
|
(4) |
(l, i, r, j) is a repeat, if it either is a forward, reversed,
complemented, or palindromic repeat. Each repeat
(l, i, r, j) specifies two substrings of S:
- the sequence
SiSi + 1...Si + l - 1 occurring on the left-hand
side of (1), (2), (3), or
(4). This is the
first instance of (l, i, r, j).
- the sequence occurring on the right-hand
side of (1), (2),
(3), or (4). This is
the second instance of (l, i, r, j).
By specifying the relation
, we obtain the notion of exact
and degenerate repeats:
- If
is the identity on sequences, then a repeat (l, i, r, j) is
an exact repeat.
- Let
k
IN. For any strings u, v we define the relation
by:
u
v
dH(u, v)
k. Then a repeat
(l, i, r, j) is a k-mismatch repeat.
- Let
k
IN. For any strings u, v we define the relation
by:
u
v
dE(u, v)
k. Then a repeat (l, i, r, j) is
a k-differences repeat.
- A repeat is degenerate if it is a k-mismatch or a k-differences
repeat for some k
0.
To distinguish between the different types of repeats, we assign to them
a distance-value according to the following rules:
- An exact repeat has distance 0.
- A k-mismatch repeat with k > 0 mismatches has distance -k.
- A k-differences repeat with k > 0 differences has distance k.
A repeat (l, i, r, j) is contained in a repeat
(l', i', r', j') if and only if
i'
i
i + l - 1
i' + l' - 1 and
j'
j
j + r - 1
j' + r' - 1.
By requiring repeats to be maximal we can reduce the number of
interesting repeats:
- An exact repeat is maximal if it is not contained in another
exact repeat of the same kind.
- A k-mismatch repeat is maximal if it is not contained in another
k-mismatch repeat of the same kind.
- A k-differences repeat is maximal if it is not contained in another k-differences repeat of the same kind.
See Table 2, for some examples of maximal repeats.
Table 2:
This table shows some maximal repeats in
S = gagctcgagcgctgct,
together with the corresponding repeat instances.
For exact repeats, both instances are
identical, so only one sequence is shown. For the 1-mismatch
repeats, there is one mismatching pair of bases x and y in
the two repeat instances. This is shown as [xy]. For the
1-differences repeat, an optimal alignment of the two repeat instances is
shown.
 |