REPselect

repselect allows to select interesting repeats from the output of repfind as specified by user-defined criteria. It delivers repeats of chosen length, degeneracy or significance into further analysis routines. Moreover, it allows to sort repeats according to different criteria. The input for repselect is a file produced by repfind or repselect using option b for binary output. The output of repselect goes to standard output.

The options for repselect are as follows:

-la
sort the repeats in ascending order of their length. In case the repeat is a k-differences repeat, the length of the repeat instance which starts first is used.

-ld
sorted the repeats in descending order of their length. In case the repeat is a k-differences repeat, the length of the repeat instance which starts first is used.

-ia
sort the repeats in ascending order of their first position. We usually denote this position by i, hence the name of the option.

-id
sort the repeats in descending order of their first position.

-ja
sort the repeats in ascending order of their second position. We usually denote this position by j, hence the name of the option.

-jd
sort repeats in descending order of their second position.

-ea
sort repeats in ascending order of their E-value.

-ed
sort repeats in descending order of their E-value.

-selfun filename.so
access file filename.so, which must be a shared object file containing a selection function selectrepeat. If the access to such a function is not possible, then the program terminates with error code 1. For details, see Section below.

-first m
only show the first m repeats, according to the sorting criteria.

-s [filename]
additionally report the repeated substrings or an alignment of the two instances of the repeat. For details see Section Output. The filename containing the input sequence is encoded in the input file for repselect. This filename may not be valid: the file may have been moved, it may have been renamed, or it may be in a different directory on the machine, where repselect is run. In such a case, the additional argument to option s allows to tell repselect the correct filename, where the input sequence may be found.

-lw q
set the linewidth to q. That is, the substrings or the alignment reported by option s is formatted to q symbols per line. If this option is not used, then the default linewidth is 60.

-iub
pairs of different residues in a mismatch repeat are shown as a single wildcard in IUB-format.

-i
only give a statistics (preview information) about the length of the different repeats found.

-b
report output in binary format. This allows to refine the output step by step, while keeping track of the intermediate selection steps.

-v
show the version of the program and terminate.

-help
show a summary of all options and terminate.

Note the following when combining the options:



Specifying a Selection Function

A selection function must be declared with the following function header:

int selectrepeat(unsigned char *seq,unsigned int seqlen,Repeat *rep)

The first argument seq of selectrepeat points to the input sequence, possibly after replacing or deleting wildcards according to the options used for repfind. Moreover, instead of the bases A, C, G, T, seq contains integers 0, 1, 2, 3 encoding the bases. That is, 0 stands for A, 1 for C, 2 for G, and 3 for T. To show the base at position i, one can simply write ALPHABET[seq[i]] in a C-statement. To refer to these integer codes, one can use the symbolic constants ACODE, CCODE, GCODE, and TCODE.

The second argument seqlen of selectrepeat is the length of the input sequence. That is, the bases of the input sequence are addressed by an index in the range [0,...,seqlen-1].

The third argument rep of selectrepeat refers to a repeat record, containing all necessary information about a repeat. The C-declaration of the type Repeat can be found in the file select.h, which is part of the binary distribution of REPuter, see below.

The function selectrepeat is applied to each repeat. If it returns a value different from 0, the repeat is shown. If the returned value is 0, then it is rejected and not shown.

For example, the following function accepts repeats of length at most 200.

int selectrepeat(unsigned char *seq,unsigned int seqlen,Repeat *rep)
{
  if(rep->length1 <= 200)
  {
    return 1;  /* accept */
  } else
  {
    return 0;  /* reject */
  }
}

Other examples for selection functions can be found in the subdirectory SELECT. This also contains the file select.h and a makefile, showing how to compile a shared object for the supported platforms.



The File select.h

The codes for the different bases are defined in the following lines:

#define ACODE           0          /* the integer code for base A */
#define CCODE           1          /* the integer code for base C */
#define GCODE           2          /* the integer code for base G */
#define TCODE           3          /* the integer code for base T */
#define ALPHABET        "acgt"     /* transform codes into bases  */

To distinguish forward, palindromic, reverse, and complemented repeats, we use the following type;

typedef enum
{
  FKIND = 0,      /* forward repeat      */
  PKIND,          /* palindromic repeat  */
  RKIND,          /* reversed repeat     */
  CKIND           /* complemented repeat */
} Kind;

There are basically two types of repeats. Exact and mismatch repeats refer to sequences of equal length. Differences repeats may refer to sequences of different length. We use the following enumeration to distinguish these types of repeats.

typedef enum
{
  Eqlenreptype = 0,    /* exact and mismatch repeat */
  Difflenreptype       /* differences repeat        */
} Reptype;

A repeat is represented by a record of the following type. If the type of the repeat is Eqlenreptype, then the component length2 is not defined, and both instances of the repeat have length length1. For the distance-value the rules as sepcified in section Basic Notions hold. If the E-value of a repeat is smaller than 10-300, then evalue=0.00.

typedef struct
{
  Reptype reptype;      /* type of the repeat                           */
  int distance;         /* distance between the two repeat instances    */
  double evalue;        /* E-value                                      */
  Kind kind;            /* one of the values FKIND, PKIND, CKIND, RKIND */
  unsigned int length1, /* the length of the first instance             */
               start1,  /* the starting position of the first instance  */
               length2, /* the length of the second instance            */
               start2;  /* the starting position of the second instance,*/
                        /* if defined                                   */
} Repeat;