REPfind

Description

repfind [options] filename
repselect [options] filename
repvis [options] filename
repfind computes all maximal repeats in the input sequence contained in the file named filename, and echoes them to standard output. If an error occurs, then the program exits with exit code 1. Otherwise, the exit code is 0. We support two formats for the input file:
fasta-format
If the first line of the file begins with the symbol >, then this line is skipped and the rest of the file is considered to be the input sequence, with white spaces ignored.
plain-format
If the first line of the file does not begin with the symbol >, then the file is taken verbatim, i.e., the entire file is considered to be the input sequence (white spaces are not ignored).

There are no special options necessary to tell the program the sequence format. It automatically detects the appropriate format.

The input sequence is allowed to contain only the base symbols A, C, G, or T and IUB-wildcards R, Y, M, K, W, S, B, D, H, V, N, in upper or lower case. If any other character occurs in the input sequence, then the program returns with exit code 1. By default, each wildcard is replaced by a random base from the set of bases the wildcard stands for, see Table 1.


Table 1: The IUB-wildcards with the set of bases they stand for
\begin{table}\begin{displaymath}\begin{array}{\vert*{6}{l}\vert}\hline
R&=&\{A,G...
...=&\{A,C,G,T\}\\
S&=&\{C,G\}&&&\\ \hline
\end{array}\end{displaymath}\end{table}


The options for repfind are as follows:

-f
report maximal forward repeats.

-p
report maximal palindromic repeats.

-r
report maximal reversed repeats.

-c
report maximal complemented repeats.

-l l
specifies the length parameter l. This must be a positive integer smaller than the length of the input sequence. Only repeats of length at least l are reported.

-h [k]
search for maximal repeats such that the two instances of the repeat contain at most k mismatches. The optional argument k is a positive integer, such that k $ \leq$ 0.5l. If it is omitted, then k = 4 is the default.

-hrate r
search for maximal repeats such that the two instances of the repeat contain at most r . l mismatches. The mismatch ratio r must be a rational number smaller or equal to 0.5.

-e [k]
search for maximal repeats such that the two instances of the repeat contain at most k differences. The optional argument k is a positive integer. If it is omitted, then k = 4 is the default. If the option noevalue is not set, then k is allowed to be at most 20.

-erate r
search for maximal repeats such that the two instances of the repeat contain at most r . l differences. The differences ratio r must be a rational number smaller or equal to 0.5.

-seedsize m
set the size of the exact seeds to be searched to the maximum of m and l. This option does not have an effect if only exact matches are searched. The user should carefully consider the choice for m, since if m is larger than l, then the program may miss some repeats of length at least l with the specified number of errors.

-allmax
report all maximal repeats, in the order in which they are found.

-best m
report at most the best m repeats according to the lexicographic ordering of their E-value, their length, and their start position. That is, if two repeats are compared, the most important criterion is their E-value. The repeat with the smaller E-value is ``better''. If both repeats have the same E-value, then the longer repeat is ``better''. If also the repeats are of the same length, then their first start position matters, and the repeat with the smaller start position is ``better''. The reported repeats satisfy the given length and error constraints.

-o r
delimit the maximal number of reported repeats to r, which is a positive integer. If there are more than r repeats, the program terminates with exit code 1.

-s
additionally report the repeated substrings or an alignment of the two instances of the repeat. For details see section Output Format.

-lw q
set the line width to q. That is, the repeat instances or their alignment, reported by option s, is formatted to q symbols per line. If this option is not used, then the default line width is 60.

-iub
pairs of different residues in a mismatch repeat are shown as a single wildcard in IUB-format.

-nodist
do not show the distance value of a repeat.

-noevalue
do not compute or report the E-value of a repeat. When sorting repeats, the E-value is set to 0.0. This option allows to set the error threshold to values larger than 20.

-i
do not show repeats, but a distribution (preview information) about the length of the different repeats found.

-b
report output in binary format. This is more compact and easier to parse, and it is the input format for repselect. Note that the binary format is portable, i.e. it is the same for all machine architectures. As a consequence, one can for example compute the repeat structure for a large genome on a SUN-workstation and later view the results with repvis on a Linux-Laptop.

-warn
echo warnings about any wildcards in the input sequence on standard error.

-iw
ignore wildcards in the input sequence. This might shorten the input sequence, and the reported positions may not refer to the positions in the original input sequence.

-mem
show the peak memory usage of repfind. If option b is used, then this information goes to standard error, otherwise to standard out.

-v
show the version of the program and terminate.

-help
show a summary of all options and terminate.

Note the following when combining options.

If any of these constraints is not satisfied, then the program gives a corresponding error message and exits with error code 1.

The algorithms implemented in repfind are not heuristic: they find all maximal k-mismatch or k-differences repeats exceeding some given length. However, usually the user only wants to see the most interesting repeats. For this reason, in the default mode (i.e. without option allmax) repfind outputs repeats according to the following rules:

  1. To be selected, the right instance of a forward repeat has to start at least k + 1 positions to the right of the left instance of the repeat.
  2. A seed is an exact repeat of length at least $ \lfloor$$ {\frac{\ell}{k+1}}$$ \rfloor$, see [2]. For each seed only the most significant repeat containing that seed is selected. In this way, ``clumps'' of repeats are represented by only one repeat.

  3. Among all repeats selected according to 1 and 2, repfind reports at most the best b repeats, where b is 50 by default, or the value set by the option best.

There is an upper bound on the length of the input sequence: If the options r, c, and p are not set, then the length is at most 134,217,724. If one of the options r, c, or p is set, then the maximal length is 67,108,860. See also option v. If the input sequence is longer than these limits, then the program exits with error code 1.

Suppose an input sequence of length n. If the options p, r, or c are not used, the program requires about 13n bytes per input character. If one of the options p, r, or c is used, then the space requirement is about 26n bytes per input character. So one should make sure that the computer, repfind runs on, has enough main memory. If it has, then the program basically runs in linear time (for details see [2]) with small constants. If there is not enough memory than expect very long running times due to paging effects.


Important Remark

The user of the program should carefully choose the length parameter l and the maximal number k of mismatches/differences: The number of maximal repeats exponentially grows with k and $ {\frac{1}{\ell}}$. Since repfind computes all maximal repeats according to the specified parameters, it might run for a very long time. This is not due to a poor implementation or an inefficient algorithm, but to the number of repeats to be computed (although not all maximal repeats may be reported, if the user only requests to see the first 50 or so).