The newdist software package
provides implementations of three similarity measures of different
expression power to quantify genomic relationships on the basis of
the gene connection model.
All core methods are implemented in Python while some
helper scripts are written in BASH; Some of
the core methods will output integer linear programs
(ILPs) in a format suitable for IBM's solver
CPLEX.
newdist has the following system
requirements:
- Python 2.7
- NetworkX ≥ 1.10
- IBM ILOG CPLEX Optimization Studio
- GNU/Unix (optional)
- Biopython ≥ 1.6 (optional)
In the gene connection model, genomes are compared based on
their bipartite gene connection graph: Given
two genomes S and T, a
gene connection graph G(S,
T) of S and
T is a bipartite graph with one vertex for
each gene of S and one vertex for each gene of
T. An edge between two vertices, one from
S and one from T,
indicates that there is some connection between the two genes
represented by these vertices. The input to the methods in this
software package are gene connection graphs that can be constructed
from BLAST tables with the provided script
pairwise_similarities.py.
The provided methods facilitate the inter-species comparison
of gene orders between two genomes by means of counting
conserved adjacencies, which are defined as
follows: Given an integer θ ≥ 1, a pair of
index positions (i, i') with i' ≤ i
+ θ in a string is a
(θ-) adjacency.
Further, a pair of adjacencies between two genomes
S and T is
conserved if
- their corresponding genes are connected in their gene
connection graph G(S, T)
and
- their corresponding genes' relative orientation is
identical
The different similarity measures provided by
newdist are expressed by the following
three problem statements:
Problem 1 (total adjacency
model). Given two genomes S
and T and a gene connection graph
G(S,T), count the number of pairs of index
positions (i,i') in S and
(j,j') in T that form a
conserved adjacency. In other words, compute adj(S,T) =
|{(i, i',j, j')) | 1 ≤ i < i' ≤ |S|, 1 ≤ j
< j' ≤ |T| and (i,i' || j,j')}|.
Problem 2 (gene matching model).
Given two genomes S and
T, a gene connection graph
G(S,T) and a real-valued parameter α
∈ [0, 1], find a bipartite matching
M in G(S, T) such that
the induced sequences
SM and
TM maximize the measure
Fα (M ) = α · adj (S M , T M ) +
(1 − α) · edg (M), where edg(M) =
|M| is the size of matching M.
(The induced sequences
SM and
TMare the subsequences
of S and T, respectively, that contain those characters incident to
edges of M.)
Solving Problem 2 is NP-hard even for 1-adjacencies.
Therefore we provide a third, intermediate measure, which is more
efficient to compute in practice, while still producing one-to-one
correspondences between gene extremities. It is defined as the size
of the largest subset of non-conflicting conserved
adjacencies found in a pair of genomes, where two
conserved adjacencies are denoted
conflicting if their intervals in
either genome are overlapping.
Problem 3 (adjacency matching
model). Given two genomes S
and T and a gene connection graph
G(S,T), let C be the set
of conserved adjacencies between S and
T. Compute the size
|C*| of a maximum
cardinality set of non-conflicting conserved adjacencies
C* ⊆ C.
newdist provides the following
core scripts:
- enumerate_adjs.py - script to solve
Problem 1 and first part of Problem 3 (see paper);
- write_ffadj_ilp.py - script to construct an ILP
in CPLEX format solving Problem 2;
- matching_simple_adjs.py - script to
solve the second part of Problem 3 for 1-adjacencies;
- write_generalized_adjs_matching_ilp.py -
script to construct an ILP in CPLEX format solving Problem 3 for
θ-adjacencies with θ > 1;
- identify_anchors.py - script to
presolve simple subgraphs as preprocessing for ILPs constructed
by write_ffadj_ilp.py and
write_generalized_adjs_matching_ilp.py;
- sol_to_matching.py - a script to
extract a matching solution from '*.sol' files produced by
CPLEX.