Output Format

repfind and repselect report maximal repeats to the standard output. Both program produce the same format. If option b is not used, and if the input sequence is longer than 100,000, then repfind additionally prints 79 dots on standard error to show the progress when computing the suffix tree.

The first line of the output shows the arguments repfind was called with. The second line shows the length of the input sequence, the maximal distance allowed (negative for Hamming distance, see Section Basic Notions), the least length, and the input file name.

The rest of the output consists of lines showing a repeat (l, i, r, j) as follows:

l i S r j k e

where

If the option i is set, then instead of the repeats, their distribution is reported: In a first line, the number of all repeats is echoed. Then at most four lines follow, where each line specifies the number of repeats of a certain type. Finally, for each of the four types of repeats, the number of repeats of a certain length is shown, if this is not 0.

Suppose the input sequence S = gagctcgagcgctgct (used in Table 2 of section Basic Notions), is contained in the file sample.seq. Then

repfind -f -r -c -p -l 6 -h 1 -s sample.seq

gives the following output:

# repfind -f -r -c -p -l 6 -h 1 -s sample.seq
# 16 -1 6 sample.seq
9 0 R 9 0 0 2.75e-04 gagctcgag
8 2 P 8 2 0 1.10e-03 gctcgagc
7 0 C 7 3 0 4.39e-03 gagctcg
6 0 P 6 0 0 1.76e-02 gagctc
6 7 P 6 7 0 1.76e-02 agcgct
6 0 F 6 6 -1 3.16e-01 gagc[tg]c

Note that the last repeat is a 1-mismatch repeat, with a mismatch of t and g in the fifth position.

Instead we can first produce a file in binary format and then display it with repselect, but this time sorting the output according to the first position.

repfind -f -r -c -p -l 6 -h 1 -b sample.seq > sample.bin
repselect -ia -s sample.bin

gives the following output:

# repfind -f -r -c -p -l 6 -h 1 sample.seq
# 16 -1 6 sample.seq
9 0 R 9 0 0 2.75e-04 gagctcgag
7 0 C 7 3 0 4.39e-03 gagctcg
6 0 P 6 0 0 1.76e-02 gagctc
6 0 F 6 6 -1 3.16e-01 gagc[tg]c
8 2 P 8 2 0 1.10e-03 gctcgagc
7 3 R 7 5 -1 9.23e-02 c[tg]cgagc
6 7 P 6 7 0 1.76e-02 agcgct

Alternatively, one can obtain the distribution of the repeats by calling

repfind -f -r -c -p -l 6 -h 1 -i sample.seq

This gives:

# repfind -f -r -c -p -l 6 -h 1 -i sample.seq
# 16 -1 6 sample.seq
# all 6
# allF 1
# allC 1
# allR 1
# allP 3
# F 6 1
# R 9 1
# C 7 1
# P 6 2
# P 8 1

Here is is another example showing how alignments for differences repeats are reported.

# repfind.x -p -l 8 -e 2 -s sample.seq
# 16 2 8 sample.seq
8 2 P 8 2 0 1.10e-03 gctcgagc
9 1 P 11 2 2 6.14e-02
a--gctcgagc                                                        11
 !!        
agcgctcgagc
10 5 P 8 6 2 1.92e-01
cgagcgctgc                                                         10
 !      ! 
c-agcgct-c
9 7 P 9 7 2 6.04e-01
agc-gctgct                                                         10
   !  !   
agcagc-gct

This output shows one exact palindromic repeat and three 2-differences palindromic repeats. The two instances of a 2-differences repeat are shown in an alignment, directly following the remaining repeat information. Columns with insertions, deletions or mismatches are marked by the symbol !. The last position of the alignment in every row is shown on the right. If the repeat instances are longer than the width of a line, it is split into different lines.

The output format is designed to contain all necessary information about the found repeats. The kind of information shown on each line can be distinguished according to the first character of the line:

With a simple awk or perl script it should be easy to transform this output format to the desired one, depending on the application. Another solution to change the output format is to write a function selectrepeat such that it always returns 0 (i.e. reject repeat) and print the repeat in a format specified by the user.