
Introduction

A tool for the systematic study of the repetitive structure of complete
genomes must satisfy the following criteria:
- Efficiency
The size of the genomes to be studied ranges up to 3-4 billion base pairs.
To do a complete analysis, algorithmic efficiency must be practically linear,
both in terms of computer memory and execution time.
- Flexibility and Significance
While exact repeats often give a first hint at the overall repetitive
structure, a biologically realistic model must recognize degenerate repeats,
which allow a certain rate of error. Flexibility also requires to
recognize not just direct repeats, but also palindromic repeats
and other sequence features closely related. In the presence of errors,
the significance of a particular pattern is not easily judged,
and a statistical assessment of significance is mandatory.
- Interactive Visualization
Since a large amount of data is generated, interactive
visualization is required. Human
investigators need to obtain an overview on a whole genome or chromosome
basis, but also must be able to zoom in on the details of a particular
repetitive region.
- Compositionality
In the long run, we expect that repeat finding is only a basic step
in explaining genome structure. Further analysis will be built on top
of the repeat finding.
Hence, the repeat finding program must provide a simple interface to enable
composition with such advanced analysis programs.
The REPuter
program family described herein satisfies these requirements
in the following way:
repfind uses an efficient and compact implementation of suffix trees in order
to locate exact repeats in linear space and time. This
time-critical task can be done in linear time for sequences up to the size
of the human genome. These exact repeats are
used as seeds from which significant degenerate repeats are constructed
allowing for mismatches, insertions, and deletions.
Note that our program is not heuristic: it
guarantees to find all degenerate repeats as specified by the parameters.
Output size can be controlled via parameters for minimum length and
maximum error. Output is sorted by significance scores (E-values)
calculated according to the distance model used.
repselect
allows to select interesting repeats
from the output of repfind
as specified by user-defined criteria.
It delivers a list of repeats
of chosen length, degeneracy or significance into further analysis routines.
repvis visualizes the output from repfind.
A color-code indicates significance scores,
and a scroll bar controls the amount of data displayed.
A zooming function provides whole genome views as well as detailed
presentations of selected regions.
This manual describes the above programs. We postpone the definition of the basic notions to the appendix.

About this Manual
To differenciate the function of information in this manual, we use
the following typographic conventions. Note that the different text
styles require a Style Sheets enabled browser.
- Normal Text - Times Roman
- Program Names - Courier, bold
- User Keyboard/Mouse Input - Courier, blue background
- Unvisited Links
- Visited Links
Program output
Also note, that due to the limitations of the HTML language, the length parameter l is sometimes written as
. The second notation is produced by the LaTeX to HTML converter.