Basic Notions

We consider a string S of length n. We index the characters of S from 0 to n - 1, i.e. S = S0S1...Sn - 1. By S-1 we denote the reverse of S, i.e. the string u of length n such that ui = Sn - 1 - i for any i $ \in$ [0, n - 1]. Let $ \Sigma$ = {a, c, g, t} be the DNA-alphabet. We define a function wcc : $ \Sigma$$ \to$$ \Sigma$ by the following equations:

wcc(a) = t  
wcc(g) = c  
wcc(c) = g  
wcc(t) = a  

We extend wcc to any sequence U = U0U1...Um over $ \Sigma$ by defining

wcc(U) = wcc(U0)wcc(U1)...wcc(Un - 1)

The Hamming distance of two equal-length strings u and v, denoted by dH(u, v), is the number of positions where u and v differ, that is, dH(u, v) = |{i $ \in$ [1,| u|] | ui $ \neq$ vi}|.

There are three kinds of edit operations: deletions, insertions, and mismatches of single characters. The edit distance or Levenshtein distance of u and v, denoted by dE(u, v), is the minimum number of edit operations needed to transform u into v.

Let $ \approx$ be a relation on pairs of DNA-sequences. Let l, r > 0, i $ \in$ [0, n - l] and j $ \in$ [0, n - r]. We define four different kinds of repeats:

(l, i, r, j) is a repeat, if it either is a forward, reversed, complemented, or palindromic repeat. Each repeat (l, i, r, j) specifies two substrings of S:
  1. the sequence SiSi + 1...Si + l - 1 occurring on the left-hand side of (1), (2), (3), or (4). This is the first instance of (l, i, r, j).
  2. the sequence occurring on the right-hand side of (1), (2), (3), or (4). This is the second instance of (l, i, r, j).

By specifying the relation $ \approx$, we obtain the notion of exact and degenerate repeats:

  1. If $ \approx$ is the identity on sequences, then a repeat (l, i, r, j) is an exact repeat.
  2. Let k $ \in$ IN. For any strings u, v we define the relation $ \approx$ by: u $ \approx$ v$ \iff$dH(u, v) $ \leq$ k. Then a repeat (l, i, r, j) is a k-mismatch repeat.
  3. Let k $ \in$ IN. For any strings u, v we define the relation $ \approx$ by: u $ \approx$ v$ \iff$dE(u, v) $ \leq$ k. Then a repeat (l, i, r, j) is a k-differences repeat.
  4. A repeat is degenerate if it is a k-mismatch or a k-differences repeat for some k $ \geq$ 0.
To distinguish between the different types of repeats, we assign to them a distance-value according to the following rules:

A repeat (l, i, r, j) is contained in a repeat (l', i', r', j') if and only if i' $ \leq$ i $ \leq$ i + l - 1 $ \leq$ i' + l' - 1 and j' $ \leq$ j $ \leq$ j + r - 1 $ \leq$ j' + r' - 1.

By requiring repeats to be maximal we can reduce the number of interesting repeats:

See Table 2, for some examples of maximal repeats.


Table 2: This table shows some maximal repeats in S = gagctcgagcgctgct, together with the corresponding repeat instances. For exact repeats, both instances are identical, so only one sequence is shown. For the 1-mismatch repeats, there is one mismatching pair of bases x and y in the two repeat instances. This is shown as [xy]. For the 1-differences repeat, an optimal alignment of the two repeat instances is shown.
\begin{table}\begin{displaymath}\begin{array}{\vert p{4.5cm}\vert c\vert c\vert ...
...gcgct} \end{array}} \end{array}\\ \hline
\end{array}\end{displaymath}\end{table}