libfid documentation

0.4.1

Preamble

This is the API documentation for the Full-text Index Data structure library (libfid).

The library libfid is published under the terms of the GNU Lesser General Public License Version 2.1 (or any later version), GNU LGPL for short. See file COPYING, shipped as part of the library sources, for the exact terms of licensing. Also included are a few programs, all published under the terms of the GNU General Public License Version 2 (or any later version), GNU GPL for short. See file COPYING, shipped as part of the library sources in directory tools, for the exact terms of licensing of these programs. See http://www.gnu.org/copyleft/ to learn more about these licenses.

Introduction

Searching for patterns in textual data is a task frequently assigned to computers, and many algorithms have been developed for solving this task for all kinds of patterns. There are many classic pattern matching algorithms for searching efficiently for exact occurrences of string patterns like simple strings or regular expressions in a text, or for approximate occurrences thereof under some model of similarity. String patterns, as well as other pattern kinds, are also of interest in the field of computational biology. This field, however, is facing the problem of an exponential increase of amount of sequence data that needs to be searched. Thus, the time spent for searches is also increasing dramatically since the running times of the search algorithms employed usually depend at least linearly on the size of the search space. Sometimes big cluster systems are the only practical way to tackle this problem.

With the decreasing costs of computer memory, be it RAM or harddisk, and the broad availability of 64 bit CPUs, a feasible alternative, or accompanying, solution for searching in large data sets is the use of full-text index data structures such as the enhanced suffix array. Search algorithms operating on enhanced suffix arrays can often achieve sublinear running times with respect to the database size, at the cost of preprocessing the sequence data and storing an index for it on harddisk. For an introduction to suffix arrays in general take a look at [1], enhanced suffix arrays are described in detail in [2].

This software library provides data structures for representing enhanced suffix arrays, and implements many operations frequently performed on these. Sequence data is generally transformed into binary representation using freely definable alphabets. The library expects enhanced suffix arrays being stored in a format as generated by mkvtree from the Vmatch package written by Stefan Kurtz. The slowbuildesa program that comes with libfid can also be used to construct enhanced suffix arrays, but be warned that slowbuildesa implements a naive suffix sorter based on quicksort and simple string comparisons, and thus is much slower than mkvtree. (The prime use of slowbuildesa is for running tests via make check, which is also the reason why it is doesn't get installed by make install.)

Please consider using our advanced tool mkesa [3] for enhanced suffix array construction, a vastly improved version of slowbuildesa based on a multithreaded Deep-Shallow [4] implementation. It is typically faster than mkvtree and also more space conserving. mkesa is available in source code under the terms of the GNU General Public License Version 2 (or any later version), and distributed as a separate package on http://bibiserv.techfak.uni-bielefeld.de/mkesa/.

Installation

Simple installation:

./configure [options] && make && make install

Optionally, run make check before installation to run the test suite. All tests should pass with no error, except for the first one which will be skipped unless you have downloaded the extra test suite data (libfid-testdata.tar.gz), and extracted it below the source path of libfid.

By default, make install will install the library below /usr/local/. This can be changed by passing the appropriate options to configure. Use ./configure --help to learn about the configuration options.

Hint: create a config.site file and let the environment variable CONFIG_SITE point to it. You can put your default compiler flags (optimization, paths, etc.) in there and avoid retyping them over again each time you need to invoke configure. All configure scripts generated by GNU Autoconf 2.x honor the CONFIG_SITE variable, so its use is not limited to libfid.

How to use libfid in your programs

Incorporating the functionalities implemented in libfid in other programs is easy as described next.

Optionally:

For a minimal source code example, see the code of program exactsearch.c from the test suite (though its command line handling is ugly).

Note that it is important to link against the correct library version since some data structures are augmented by extra data fields in debug mode. Hence, linking a program compiled without defining symbol DEBUG against the debug version of libfid will produce a program that is likely to crash, or to behave somewhat "funny" otherwise.

Runtime behavior

Environment variable FID_OPTIONS (see FID_OPTIONS_VARNAME and fid_options_parse()) can be defined to take some influence on how the library behaves in certain situations.

References

[1] U. Manber and E.W. Myers. Suffix Arrays: A New Method for On-Line String Searches. SIAM Journal on Computing, 22(5):935-948, 1993.

[2] M.I. Abouelhoda, S. Kurtz, and E. Ohlebusch. Replacing Suffix Trees with Enhanced Suffix Arrays. Journal of Discrete Algorithms, 2:53-86, 2004.

[3] R. Homann, D. Fleer, R. Giegerich, M. Rehmsmeier. mkESA: enhanced suffix array construction tool. Bioinformatics, 25(8):1084-1085, 2009.

[4] G. Manzini, P. Ferragina. Engineering a Lightweight Suffix Array Construction Algorithm. Algorithmica, 40(1):33-50, 2004.


Generated on Wed Jul 8 17:21:15 2009 for Full-text Index Data structure library by  doxygen 1.5.9