Alphabet handling


Detailed Description

Conversion between printable characters and binary symbols.

Data Structures

struct  fid_ArraySymbol
 An array of symbols, i.e., a sequence of dynamic size. More...
struct  fid_Alphabet
 Definition of an alphabet. More...

Defines

#define fid_SYMFMT   "%hhu"
 Format string for printing the numeric value of a fid_Symbol.
#define fid_SEPARATOR   ((fid_Symbol)UCHAR_MAX)
 Special symbol: sequence separator.
#define fid_WILDCARD   ((fid_Symbol)(UCHAR_MAX-1))
 Special symbol: wildcard character.
#define fid_UNDEF   ((fid_Symbol)(UCHAR_MAX-2))
 Special symbol: undefined symbol.
#define fid_SYMBOLMAX   ((fid_Symbol)(UCHAR_MAX-3))
 Maximum allowed value for a symbol.
#define fid_REGULARSYMBOL(S)   ((S) <= fid_UNDEF)
 Check whether symbol S is a sequence symbol or not.
#define fid_SPECIALSYMBOL(S)   ((S) > fid_UNDEF)
 The opposite of fid_REGULARSYMBOL().
#define fid_PRINT_SYMBOL(ALPHA, S)
 Transform binary symbol into its printable form, honoring specials.
#define fid_CHAR_AS_INDEX(C)   ((size_t)((unsigned char)(C)))
 Type cast printable character into unsigned array index.

Typedefs

typedef unsigned char fid_Symbol
 Use this type to denote a binary transformed sequence symbol.

Enumerations

enum  fid_Alphabettype { fid_ALPHABET_DNA, fid_ALPHABET_RNA, fid_ALPHABET_DNARNA, fid_ALPHABET_PROTEIN }
 Identifiers of built-in alphabets. More...

Functions

int fid_alphabet_init_from_speclines (fid_Alphabet *alpha, const char *str, size_t len, fid_Error *error)
 Parse alphabet definition and fill alphabet structure.
int fid_alphabet_init_from_specfile (fid_Alphabet *alpha, const char *filename, fid_Error *error)
 Parse alphabet definition file and fill alphabet structure.
int fid_alphabet_init_from_string (fid_Alphabet *alpha, const char *string, size_t length, fid_Error *error)
 Determine alphabet from ASCII text.
void fid_alphabet_init_standard (fid_Alphabet *alpha, fid_Alphabettype type)
 Assign standard alphabet to alphabet structure.
int fid_alphabet_add_wildcard (fid_Alphabet *alpha, char wcchar, fid_Error *error)
 Add wildcard character to alphabet.
size_t fid_alphabet_transform_string (const fid_Alphabet *alpha, const char *string, size_t length, fid_Symbol *transformed, int no_special_symbols)
 Transform string according to given alphabet.
size_t fid_alphabet_transform_string_inplace (const fid_Alphabet *alpha, char *string, size_t length, int no_special_symbols)
 Transform string according to given alphabet.
fid_Symbolfid_alphabet_transform_string_new (const fid_Alphabet *alpha, const char *string, size_t length, int no_special_symbols, fid_Error *error)
 Transform string according to given alphabet into new buffer.
int fid_alphabet_write_to_file (const fid_Alphabet *alpha, const char *basefilename, fid_Error *error)
 Write textual representation of alphabet to file.
void fid_alphabet_dump (const fid_Alphabet *alpha, FILE *stream)
 Print alphabet to output stream.

Define Documentation

#define fid_SYMFMT   "%hhu"

Format string for printing the numeric value of a fid_Symbol.

Definition at line 48 of file alphabet.h.

Referenced by fid_alphabet_add_wildcard(), fid_alphabet_dump(), and fid_alphabet_init_from_string().

#define fid_SEPARATOR   ((fid_Symbol)UCHAR_MAX)

Special symbol: sequence separator.

Definition at line 58 of file alphabet.h.

Referenced by fid_alphabet_dump(), and fid_sequences_dump_range().

#define fid_WILDCARD   ((fid_Symbol)(UCHAR_MAX-1))

#define fid_UNDEF   ((fid_Symbol)(UCHAR_MAX-2))

Special symbol: undefined symbol.

Definition at line 68 of file alphabet.h.

Referenced by fid_alphabet_add_wildcard(), fid_alphabet_dump(), fid_alphabet_init_from_speclines(), and fid_alphabet_init_from_string().

#define fid_SYMBOLMAX   ((fid_Symbol)(UCHAR_MAX-3))

Maximum allowed value for a symbol.

Note that this is not the maximum number of symbols, but the maximum allowed value of a symbol.

Definition at line 76 of file alphabet.h.

Referenced by fid_alphabet_add_wildcard(), fid_alphabet_dump(), fid_alphabet_init_from_string(), fid_alphabet_transform_string(), and fid_alphabet_write_to_file().

#define fid_REGULARSYMBOL (  )     ((S) <= fid_UNDEF)

Check whether symbol S is a sequence symbol or not.

Note that undefined characters are considered regular symbols, but wildcards and sequence separators are not.

Parameters:
S A symbol of type fid_Symbol.

Definition at line 86 of file alphabet.h.

Referenced by fid_suffixarray_find_embedded_interval(), fid_suffixarray_get_intervals(), fid_suffixarray_traverse(), and fid_suffixinterval_lcpvalue().

#define fid_SPECIALSYMBOL (  )     ((S) > fid_UNDEF)

#define fid_PRINT_SYMBOL ( ALPHA,
 ) 

Value:

((S) == fid_UNDEF\
     ?'~'\
     :((S) == fid_SEPARATOR\
         ?'|'\
         :(ALPHA)->sym_to_char[(size_t)(S)]))
Transform binary symbol into its printable form, honoring specials.

Use this macro whenver presenting alphabet encoded sequences to human beings. The given symbol is transformed into its printable form, so that undefined symbols and sequence separators are also printed correctly.

Parameters:
ALPHA A pointer to a fid_Alphabet structure.
S A symbol encoded by and to be decoded via alphabet ALPHA.
Returns:
A printable character.

Definition at line 141 of file alphabet.h.

Referenced by fid_sequences_dump_range(), fid_suffixarray_dump_intervals(), and fid_suffixinterval_dump().

#define fid_CHAR_AS_INDEX (  )     ((size_t)((unsigned char)(C)))

Type cast printable character into unsigned array index.

Converting a signed char into some bigger unsigned type can go very wrong if not done carefully. This macro is careful. Use it for accessing fid_Alphabet::char_to_sym.

Parameters:
C A printable character, type char.
Returns:
An array index.

Definition at line 159 of file alphabet.h.

Referenced by fid_alphabet_add_wildcard(), fid_alphabet_init_from_speclines(), fid_alphabet_init_from_string(), and fid_alphabet_transform_string().


Typedef Documentation

typedef unsigned char fid_Symbol

Use this type to denote a binary transformed sequence symbol.

This type has been introduced for pure documentary reasons. There is no, and probably never will be, any form of wide character or unicode support. It would be safe to use unsigned char all the time, but using fid_Symbol instead makes the code much more readable and understandable.

Definition at line 43 of file alphabet.h.


Enumeration Type Documentation

Identifiers of built-in alphabets.

An alphabet structure can be initialized by a library function to define one of the standard alphabets supported by the library.

Enumerator:
fid_ALPHABET_DNA  Standard DNA alphabet with common wildcards.
fid_ALPHABET_RNA  Standard RNA alphabet with common wildcards.
fid_ALPHABET_DNARNA  Mixed DNA and RNA alphabet, with T and U being defined equivalent.
fid_ALPHABET_PROTEIN  Standard amino acid alphabet.

Definition at line 167 of file alphabet.h.


Function Documentation

int fid_alphabet_init_from_speclines ( fid_Alphabet alpha,
const char *  str,
size_t  len,
fid_Error error 
)

Parse alphabet definition and fill alphabet structure.

The alphabet definition consists of multiple lines, each containing characters to be considered as equal. Thus, each line defines a character class. Symbols are assigned to character classes in increasing order. Lines starting with a '#' character are treated as comments.

Parameters:
alpha The structure to be filled according to the passed definition.
str Alphabet definition file content.
len Length of str. If 0, then the length will be determined using strlen(3).
error Error messages go here.
Returns:
0 on success, -1 on error.

Definition at line 54 of file alphabet.c.

References fid_Alphabet::char_to_sym, fid_CHAR_AS_INDEX, fid_error_throw(), fid_UNDEF, fid_WILDCARD, fid_Alphabet::num_of_chars, fid_Alphabet::num_of_syms, and fid_Alphabet::sym_to_char.

Referenced by fid_alphabet_init_from_specfile(), and fid_alphabet_init_standard().

int fid_alphabet_init_from_specfile ( fid_Alphabet alpha,
const char *  filename,
fid_Error error 
)

Parse alphabet definition file and fill alphabet structure.

The alphabet definition is read from file. See fid_alphabet_init_from_speclines() for details.

Parameters:
alpha The structure to be filled according to the alphabet definition from the file.
filename Name of the file containing an alphabet definition.
error Error messages go here.
Returns:
0 on success, -1 on error.

Definition at line 170 of file alphabet.c.

References fid_Mappedfile::content, fid_alphabet_init_from_speclines(), fid_error_throw(), fid_file_map(), fid_file_unmap(), and fid_Mappedfile::occupied.

Referenced by fid_suffixarray_load_from_files().

int fid_alphabet_init_from_string ( fid_Alphabet alpha,
const char *  str,
size_t  len,
fid_Error error 
)

Determine alphabet from ASCII text.

Any character in the passed string is put into the alphabet as regular symbol. No attempt is made to decode any other text format than ASCII (like UTF-8), and no wildcards will be added by this function.

Parameters:
alpha The structure to be filled according to the text string.
str Arbitrary text string.
len Length of str. If 0, then the length will be determined using strlen(3).
error Error messages go here.
Returns:
0 on success, -1 on error.

Definition at line 206 of file alphabet.c.

References fid_Alphabet::char_to_sym, fid_CHAR_AS_INDEX, fid_error_throw(), fid_SYMBOLMAX, fid_SYMFMT, fid_UNDEF, fid_Alphabet::num_of_chars, fid_Alphabet::num_of_syms, and fid_Alphabet::sym_to_char.

void fid_alphabet_init_standard ( fid_Alphabet alpha,
fid_Alphabettype  type 
)

Assign standard alphabet to alphabet structure.

Several commonly used alphabets are defined within this library. The type of the desired alphabet is selected by an alphabet identifier.

Parameters:
alpha The structure to be filled.
type Identifier of a standard alphabet.

Definition at line 265 of file alphabet.c.

References fid_ALPHABET_DNA, fid_ALPHABET_DNARNA, fid_alphabet_init_from_speclines(), fid_ALPHABET_PROTEIN, and fid_ALPHABET_RNA.

int fid_alphabet_add_wildcard ( fid_Alphabet alpha,
char  wcchar,
fid_Error error 
)

Add wildcard character to alphabet.

Note that this function is very useful when initializing alphabets via fid_alphabet_init_from_string().

Parameters:
alpha The alphabet the wildcard should be added to.
wcchar ASCII representation of the wildcard. If wcchar is already mapped to the wildcard symbol by the alphabet, then this function does not change the alphabet. If wcchar is already mapped to some regular symbol by the alphabet, then this function returns an error. Note that wcchar must not be 0.
error Error messages go here.
Returns:
0 on success, -1 on error.

Definition at line 297 of file alphabet.c.

References fid_Alphabet::char_to_sym, fid_CHAR_AS_INDEX, fid_error_throw(), fid_SYMBOLMAX, fid_SYMFMT, fid_UNDEF, fid_WILDCARD, fid_Alphabet::num_of_chars, fid_Alphabet::num_of_syms, and fid_Alphabet::sym_to_char.

size_t fid_alphabet_transform_string ( const fid_Alphabet alpha,
const char *  string,
size_t  length,
fid_Symbol transformed,
int  no_special_symbols 
)

Transform string according to given alphabet.

Parameters:
alpha Alphabet used for transformation.
string The input string to be transformed.
length The number of characters in string. If 0, then the length of string will be determined within this function. In this case the string must be zero terminated.
transformed Buffer the transformed string is written to. This must be of size at least the length of string.
no_special_symbols If unequal to 0, then stop transformation and return an error if a character in string is transformed into a special symbol.
Returns:
0 on success, or the index+1 of the first special symbol when special symbols were not allowed. Note that the transformation will be incomplete if a positive value is returned.

Definition at line 372 of file alphabet.c.

References fid_Alphabet::char_to_sym, fid_CHAR_AS_INDEX, and fid_SYMBOLMAX.

Referenced by fid_alphabet_transform_string_inplace(), and fid_alphabet_transform_string_new().

size_t fid_alphabet_transform_string_inplace ( const fid_Alphabet alpha,
char *  string,
size_t  length,
int  no_special_symbols 
)

Transform string according to given alphabet.

This function replaces the original string by the transformed string in the same buffer.

Parameters:
alpha Alphabet used for transformation.
string The string to be transformed in place.
length The number of characters in string. If 0, then the length of string will be determined within this function. In this case the string must be zero terminated.
no_special_symbols If unequal to 0, then stop transformation and return an error if a character in string is transformed into a special symbol.
Returns:
0 on success, or the index+1 of the first special symbol when special symbols were not allowed. Note that the transformation will be incomplete if a positive value is returned.

Definition at line 434 of file alphabet.c.

References fid_alphabet_transform_string().

fid_Symbol* fid_alphabet_transform_string_new ( const fid_Alphabet alpha,
const char *  string,
size_t  length,
int  no_special_symbols,
fid_Error error 
)

Transform string according to given alphabet into new buffer.

This function allocates a buffer large enough to hold the transformed string and returns that to the caller.

Parameters:
alpha Alphabet used for transformation.
string The input string to be transformed.
length The number of characters in string. If 0, then the length of string will be determined within this function. In this case the string must be zero terminated.
no_special_symbols If unequal to 0, then stop transformation and return an error if a character in string is transformed into a special symbol.
error Error messages go here.
Returns:
A pointer to an allocated buffer holding the transformed string. A NULL pointer may be returned in the following cases:
  • The length of string was found to be 0.
  • The input string contains at least one character that is transformed into a special symbol be the alphabet. An appropriate error message will be added to error in this case.
  • No memory could be allocated. An appropriate error message will be added to error in this case (if possible at all in this condition).

Definition at line 469 of file alphabet.c.

References fid_alphabet_transform_string(), fid_error_throw(), and fid_OUTOFMEM.

int fid_alphabet_write_to_file ( const fid_Alphabet alpha,
const char *  basefilename,
fid_Error error 
)

Write textual representation of alphabet to file.

The function creates a new file and writes an alphabet definition file based on alpha to that file. The filename will have the extension "al1" appended to basefilename.

Parameters:
alpha Alphabet structure whole content shall be written to file.
basefilename The base filename of the enhanced suffix array.
error Error messages go here.
Returns:
0 on success, -1 on error.

Definition at line 523 of file alphabet.c.

References fid_Mappedfile::allocated, fid_Alphabet::char_to_sym, fid_Mappedfile::content, fid_file_allocate(), fid_file_unmap(), fid_filename_create(), fid_SYMBOLMAX, fid_WILDCARD, fid_Alphabet::num_of_chars, fid_Alphabet::num_of_syms, and fid_Mappedfile::occupied.

void fid_alphabet_dump ( const fid_Alphabet alpha,
FILE *  stream 
)

Print alphabet to output stream.

Parameters:
alpha The alphabet structure to be printed out.
stream An output stream to which the alphabet is printed. If NULL, nothing will be printed.

Definition at line 586 of file alphabet.c.

References fid_Alphabet::char_to_sym, fid_SEPARATOR, fid_SYMBOLMAX, fid_SYMFMT, fid_UNDEF, fid_WILDCARD, fid_Alphabet::num_of_chars, fid_Alphabet::num_of_syms, and fid_Alphabet::sym_to_char.


Generated on Wed Jul 8 17:21:16 2009 for Full-text Index Data structure library by  doxygen 1.5.9