
REPfind

Description
repfind [options] filename
repselect [options] filename
repvis [options] filename
repfind computes all maximal repeats in
the input sequence contained in the file named
filename, and echoes them to standard output.
If an error occurs, then the program exits with exit code 1. Otherwise,
the exit code is 0.
We support two formats for the input file:
- fasta-format
- If the first line of the file begins with the symbol >,
then this line is skipped and the rest of the file is considered to be the
input sequence, with white spaces ignored.
- plain-format
- If the first line of the file does not begin with the symbol
>, then the file is taken verbatim, i.e., the
entire file is considered to be the input sequence (white spaces are
not ignored).
There are no special options necessary to tell the program the sequence format.
It automatically detects the appropriate format.
The input sequence is allowed to contain only the base symbols
A, C, G, or T and IUB-wildcards
R, Y, M, K, W, S, B, D, H, V, N,
in upper or lower case. If any other character occurs in the input sequence,
then the program returns with exit code 1.
By default, each wildcard is replaced by a random base from the set of bases
the wildcard stands for, see Table 1.
Table 1:
The IUB-wildcards with the set of bases they stand for
 |
The options for repfind
are as follows:
- -f
- report maximal forward repeats.
- -p
- report maximal palindromic repeats.
- -r
- report maximal reversed repeats.
- -c
- report maximal complemented repeats.
- -l l
- specifies the length parameter l. This must be a positive integer
smaller than the length of the input sequence. Only repeats of length
at least l are reported.
- -h [k]
- search for maximal repeats such that
the two instances of the repeat contain at most k mismatches. The
optional argument k is a positive integer, such that
k
0.5l. If it is omitted, then k = 4 is the default.
- -hrate r
- search for maximal repeats such that
the two instances of the repeat contain at most
r . l mismatches. The mismatch ratio r must be a rational number
smaller or equal to 0.5.
- -e [k]
- search for maximal repeats such that
the two instances of the repeat contain at most k differences. The
optional argument k is a positive integer. If it is omitted, then
k = 4 is the default. If the option noevalue is not set, then
k is allowed to be at most 20.
- -erate r
- search for maximal repeats such that
the two instances of the repeat contain at most
r . l differences. The differences ratio r must be a rational number
smaller or equal to 0.5.
- -seedsize m
- set the size of the exact seeds to be searched to the maximum of m and l.
This option does not have an effect if only exact matches are searched. The
user should carefully consider the choice for m, since if
m is larger than l, then the program may miss some repeats of length
at least l with the specified number of errors.
- -allmax
- report all maximal repeats, in the order in which
they are found.
- -best m
- report at most the best m repeats according to the lexicographic ordering
of their E-value, their length, and their start position. That is, if two
repeats are compared, the most important criterion is their E-value.
The repeat with
the smaller E-value is ``better''. If both repeats have the same E-value,
then the longer repeat is ``better''. If also the repeats are of the
same length, then their first start position matters, and the repeat with the
smaller start position is ``better''.
The reported repeats satisfy the given length and error constraints.
- -o r
- delimit the maximal number of reported repeats to
r, which is a positive integer. If there are more than
r repeats, the program terminates with exit code 1.
- -s
- additionally report the repeated substrings or an alignment of the two
instances of the repeat. For details see section Output Format.
- -lw q
- set the line width to q. That is, the repeat instances or their alignment,
reported by option s, is formatted to q symbols per line.
If this option is not used, then the default line width is 60.
- -iub
- pairs of different residues in a mismatch repeat
are shown as a single wildcard in IUB-format.
- -nodist
- do not show the distance value of a repeat.
- -noevalue
- do not compute or report the E-value of a repeat. When sorting repeats,
the E-value is set to 0.0. This option allows to set the error
threshold to values larger than 20.
- -i
- do not show repeats, but a distribution (preview information)
about the length of the different repeats found.
- -b
- report output in binary format. This is more compact
and easier to parse, and it is the input format for repselect.
Note that the binary format is portable, i.e. it is the same for all
machine architectures. As a consequence, one can for example
compute the repeat structure for a large genome on a SUN-workstation
and later view the results with repvis on a Linux-Laptop.
- -warn
- echo warnings about any wildcards in the input sequence on standard error.
- -iw
- ignore wildcards in the input sequence.
This might shorten the input sequence, and the reported positions
may not refer to the positions in the original input sequence.
- -mem
- show the peak memory usage of repfind. If option b is used, then
this information goes to standard error, otherwise to standard out.
- -v
- show the version of the program and terminate.
- -help
- show a summary of all options and terminate.
Note the following when combining options.
- Each option can be used only once.
- The options f, r, c, p can be combined,
if exact repeats are search, or if the option allmax is not used.
If none of these four options is used, then the default
is to search for forward repeats.
- Only one of the four options e, h, erate, and
hrate can be used.
- The following pairs of options cannot be combined:
i/s, b/s, i/b,
i/o, best/o, best/i,
best/allmax, warn/iw.
- If option l is not used, then option allmax is not allowed.
- If the options l, h, and e are not used, then
repfind automatically determines the length parameter l
such that l is maximal and there are at least
b exact repeats of length at least l,
where b is the number of repeats requested. The determination of l
requires an extra suffix tree traversal, so the program may require more
time.
- If the option l is not used, but the option h or the
option e, then these are not allowed to have an argument.
If option h is used, then 4 mismatches are allowed.
If option e is used, then 4 differences are allowed.
Moreover, l (the length parameter)
is automatically determined according to the repetitiveness of the
input sequence. We have found this default value for the number of
errors and the automatic determination of l to give good initial
results, which can then be refined by the user if necessary.
- If option l is used, and none of the options h,
e, hrate, or erate, then no mismatches/differences
are allowed.
- If option l is used, then the options h and
e must have a positive integer argument.
- If options best and allmax are not used, then the best
50 repeats are reported ordered by their significance.
- If option hrate or erate is used, then option l
must be used.
- If option i is used, then option allmax must be used.
- If option noevalue is not used, then the number of mismatches or
differences is limited to 20. This is due to the fact, that we only have
precomputed E-values for at most 20 mismatches/differences.
If any of these constraints is not satisfied, then the program gives a
corresponding error message and exits with error code 1.
The algorithms implemented in repfind are not heuristic: they find all maximal k-mismatch or
k-differences repeats exceeding some given length. However, usually
the user only wants to see the most interesting repeats. For this reason,
in the default mode (i.e. without option allmax) repfind outputs repeats according to the following rules:
- To be selected, the right instance of a forward repeat has to start at least
k + 1 positions to the right of the left instance of the repeat.
- A seed is an exact repeat of length at least


,
see [2]. For each seed only the most significant repeat
containing that seed is selected. In this way, ``clumps'' of repeats
are represented by only one repeat.
- Among all repeats selected according to 1 and
2, repfind reports at most the best b repeats, where b is 50 by default,
or the value set by the option best.
There is an upper bound on the length of the input sequence:
If the options r, c, and p are not set,
then the length is at most 134,217,724.
If one of the options r, c, or p is set,
then the maximal length is 67,108,860. See also option v.
If the input sequence is longer than these limits, then
the program exits with error code 1.
Suppose an input sequence of length n. If the options p,
r, or c are not used, the program requires about
13n bytes per input character. If one of the options p,
r, or c is used, then the space requirement is about
26n bytes per input character. So one should make sure that the computer,
repfind runs on, has enough main memory. If it has, then the program
basically runs in linear time (for details see [2]) with small
constants. If there is not enough memory than expect very long running times
due to paging effects.
Important Remark
The user of the program should carefully choose the length parameter l
and the maximal number k of mismatches/differences: The number of maximal
repeats exponentially grows with k and
. Since repfind
computes all maximal
repeats according to the specified parameters, it might run for a very long
time. This is not due to a poor implementation or an inefficient algorithm,
but to the number of repeats to be computed (although not
all maximal repeats may be reported, if the user only requests to see
the first 50 or so).