All Downloads are FREE. Search and download functionalities are using the official Maven repository.

net.maizegenetics.analysis.gbs.neobio.CrochemoreLandauZivUkelson Maven / Gradle / Ivy

/*
 * CrochemoreLandauZivUkelson.java
 *
 * Copyright 2003 Sergio Anibal de Carvalho Junior
 *
 * This file is part of NeoBio.
 *
 * NeoBio is free software; you can redistribute it and/or modify it under the terms of
 * the GNU General Public License as published by the Free Software Foundation; either
 * version 2 of the License, or (at your option) any later version.
 *
 * NeoBio is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY;
 * without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR
 * PURPOSE. See the GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License along with NeoBio;
 * if not, write to the Free Software Foundation, Inc., 59 Temple Place, Suite 330,
 * Boston, MA 02111-1307, USA.
 *
 * Proper attribution of the author as the source of the software would be appreciated.
 *
 * Sergio Anibal de Carvalho Junior		mailto:[email protected]
 * Department of Computer Science		http://www.dcs.kcl.ac.uk
 * King's College London, UK			http://www.kcl.ac.uk
 *
 * Please visit http://neobio.sourceforge.net
 *
 * This project was supervised by Professor Maxime Crochemore.
 *
 */

package net.maizegenetics.analysis.gbs.neobio;

import java.io.Reader;
import java.io.IOException;

/**
 * This abstract class is the superclass of both global and local sequence alignment
 * algorithms (with linear gap penalty function) due to Maxime Crochemore, Gad Landau and
 * Michal Ziv-Ukelson (2002).
 *
 * 

This implementation derives from the paper of M.Crochemore, G.Landau and * M.Ziv-Ukelson, A Sub-quadratic Sequence Alignment Algorithm for Unrestricted Scoring * Matrices (available here as * PDF or * Postscript).

* *

It employs Lempel-Ziv compression techniques to speed up the classic dynamic * programmin approach to sequence alignment (see {@linkplain NeedlemanWunsch} and * {@linkplain SmithWaterman} classes). It reduces the time and space complexity from * O(n2) to O(n2/log n). In fact, the complexity is actually O(h * n2/log n), where 0 <= h <= 1 is a real number denoting the entropy of the * text (a measure of how compressible it is). This means that, the more compressible a * sequence is, the less memory the algorithm will require, and the faster it will * run.

* *

The idea behind this improvement is to identify repetitions in the sequences and * reuse the computation of their alignments. The first step is, therefore, to parse the * sequences into LZ78-like factors. LZ78 is a popular compression algorithm of the * Lempel-Ziv familiy due to J.Ziv and A.Lempel (1978). This factorisation is accomplished * by the {@linkplain FactorSequence} class (for more information about this * factorisation, please refer to the specification of that class) that builds a * doubly-linked list of factors. Each factor is an instance of the {@linkplain Factor} * class (refer to the specification of this class for more information).

* *

Once the sequences have been parsed, the algorithm builds a matrix of blocks, called * block table, that is vaguely similar to the dynamic programming matrix used by both * NeedlemanWunsch and SmithWaterman. Each block contains an * instance of an {@linkplain AlignmentBlock} (please refer to its specification for more * information on what information is stored) and represents the alignment beteween one * factor of each sequence. This block table is, in fact, a partition of the alignment * graph.

* *

Consider a block B which corresponds to the alignment of factor F1 = xa from * sequence S1 and factor F2 = yb from sequence S2. Here, F1 extends a previous factor of * S1 with character a, while F2 extends a previous factor of S2 with character b. We can * define the input border of B as the set of values at the left and top borders of block * B, and the output border as the set of values at the right and bottom borders of B. * Moreover, we can define the following prefix blocks of B:

* *
    *
  • Left prefix - is the block that contains the alignment of factor F1 with * a factor F2' = y (a prefix of factor F2). *
  • Diagonal prefix - is the block that contains the alignment of factor F1' = x * (a prefix of factor F1) with a factor F2' = y (a prefix of factor F2). *
  • Top prefix - is the block that contains the alignment of factor F1' = x (a * prefix of factor F1) with factor F2. *
* *

Note that each factor has a pointer to its prefix factor, called ancestor (see the * specification of the Factor class). This pointer makes it easy to retrieve * any of the three prefix blocks of B in constant time.

* *

Rather than computing each value of the alignment block B, the algorithm will only * compute the values on its input and output borders. This is precisely what makes it * more efficient.

* *

In this class there is a general specification of how the block table is computed * (see the {@link #computeBlockTable computeBlockTable} method for details). The actual * method depends on the subclasses. In general, there are two phases:

* *
    *
  • Encoding - the structure of a block B is studied and represented in an * efficient way by computing weights of optimal paths connecting each entry of its input * border to each entry of its output border. This information is encoded in a DIST matrix * where DIST[i,j] stores the weight of an optimal paths connecting entry i of the input * border to entry j the output border of B. *
  • Propagation - the input border of a block B is retrieved (from the left and * top blocks of B) and its output border is computed with the help of the DIST matrix. *
* *

In fact, for each block, only one column of the DIST matrix needs to be computed, * all other columns are actually retrieved from its prefix blocks. This is precisely what * is accomplished by the {@link #assembleDistMatrix assembleDistMatrix} method found in * this class (it is general enough for both global and local alignment versions of the * algorithm.

* *

From the DIST matrix, we obtain the OUT matrix defined as * OUT[i,j] = I[i] + DIST[i,j] where I is the input border array. This means * that the OUT matrix is the DIST matrix updated by the input border of a block. The * output border is then computed from the OUT matrix by taking the maximum value of each * column. This class also have a general method for assembling the input border (see * {@link #assembleInputBorder assembleInputBorder}

* *

The OUT matrix is encoded by the {@linkplain OutMatrix} class that takes as * both a DIST matrix and an input border array. Note that it does not compute the OUT * matrix, it just stores the necessary information to retrieve a value at any * position of the matrix.

* *

A naive approach to compute the output border of a block from the OUT matrix of size * n x n would take a time proportional to n2. However, it happens that, due to * the nature of the DIST matrix, both DIST and OUT matrices are Monge arrays, which * implies that they are also totally monotone. This property allows the * computation of the output border of B in linear time with the SMAWK algorithm (see the * specification of the {@linkplain Smawk} class for more information on SMAWK).

* *

This class contains a general specification that is pertinent to both global and * local versions of the algorithm. For more information on each version of, please refer * to the appropriate subclass.

* *

A note about the performance of these algorithms. Although theoretical * results suggest that these algorithms are faster and consume less memory than the * classical methods, in practice it is hard to realise their performance gains. * *

These algorithms are extremely complex and require the storage of many extra * pointers and other auxiliary data for each block (see the AlignmentBlock * class for more details). Hence, even though the space requirement is * O(n2/log n), which is less than O(n2), in practice, for most of * the cases these algorithms will take more time and memory space than their clasical * counterparts (we have to keep in mind that the Big Oh notation ignores all constants * involved).

* *

Therefore, in order to realise the full power of these algorithms, they have to be * used with extremly large and redundant sequences. This will allow a proper * reutilisation of the computations and, maybe, provide an improvement in terms of space * and run time. For instance, it is easy to devise such a sequence if we use a * one-character alphabet because, in this case, a sequence is factorised into a series * of factors that are a prefix of the next.

* * @author Sergio A. de Carvalho Jr. * @see CrochemoreLandauZivUkelsonGlobalAlignment * @see CrochemoreLandauZivUkelsonLocalAlignment * @see NeedlemanWunsch * @see SmithWaterman * @see FactorSequence * @see AlignmentBlock * @see OutMatrix * @see Smawk * @see #computeBlockTable * @see #assembleDistMatrix */ public abstract class CrochemoreLandauZivUkelson extends PairwiseAlignmentAlgorithm { /** * A constant that indicates that the source of an optimal path has been reached in a * block and that the trace back procedure to retrieve a high scoring alignment can * stop. */ protected static final byte STOP_DIRECTION = 0; /** * A constant that indicates that the left direction must be followed to reach the * source of an optimal path in a block during the trace back procedure to retrieve a * high scoring alignment. */ protected static final byte LEFT_DIRECTION = 1; /** * A constant that indicates that the diagonal direction must be followed to reach the * source of an optimal path in a block during the trace back procedure to retrieve a * high scoring alignment. */ protected static final byte DIAGONAL_DIRECTION = 2; /** * A constant that indicates that the top direction must be followed to reach the * source of an optimal path in a block during the trace back procedure to retrieve a * high scoring alignment. */ protected static final byte TOP_DIRECTION = 3; /** * The first factorised sequence being aligned. */ protected FactorSequence seq1; /** * The second factorised sequence being aligned. */ protected FactorSequence seq2; /** * The block table, which is a matrix of alignment blocks where each block represents * the alignment between one factor of each sequence. */ protected AlignmentBlock[][] block_table; /** * Number of rows of the block table. It is determined by the number of factors of the * first sequence. */ protected int num_rows; /** * Number of columns of the block table. It is determined by the number of factors of * the second sequence. */ protected int num_cols; /** * An instance of the Smawk class that implements the SMAWK algorithm to * compute the column maxima of a totally monotone matrix. It is used to speed up the * computation of the output border of a block. */ protected Smawk smawk = new Smawk(); /** * An instance of the OutMatrix class that encodes the OUT matrix of a * block when supplied with the DIST matrix and the input border array of a block. * Note that it does not compute the OUT matrix itselft, it just stores the necessary * information to retrieve a value at any position of the matrix. * *

This object is then used to compute the output border of a block with the * Smawk class. Note that the OutMatrix class implements the * Matrix interface as required by the Smawk class. * * @see Matrix * @see Smawk */ protected OutMatrix out_matrix = new OutMatrix (); /** * Loads sequences into FactorSequence instances. In case of any error, * an exception is raised by the constructor of FactorSequence (please * check the specification of that class for specific requirements). * *

A FactorSequence is an LZ78-like factorisation of the sequences * being aligned. * * @param input1 Input for first sequence * @param input2 Input for second sequence * @throws IOException If an I/O error occurs when reading the sequences * @throws InvalidSequenceException If the sequences are not valid * @see FactorSequence */ protected void loadSequencesInternal (Reader input1, Reader input2) throws IOException, InvalidSequenceException { // load sequences into instances of CharSequence this.seq1 = new FactorSequence(input1); this.seq2 = new FactorSequence(input2); // determine the block table's dimensions this.num_rows = seq1.numFactors(); this.num_cols = seq2.numFactors(); } /** * Frees pointers to loaded sequences and the the block table so that their data can * be garbage collected. */ protected void unloadSequencesInternal () { this.seq1 = null; this.seq2 = null; this.block_table = null; } /** * Computes the block table (the result depends on subclasses, see * computeBlockTable for details) and requests subclasses to retrieve an * optimal alignment between the loaded sequences. The actual product depends on the * subclasses which can produce a global (see * CrochemoreLandauZivUkelsonGlobalAlignment) or local alignment (see * CrochemoreLandauZivUkelsonLocalAlignment). * *

Subclasses are required to implement the buildOptimalAlignment * abstract method defined by this class according to their own method.

* * @return an optimal alignment between the loaded sequences * @throws IncompatibleScoringSchemeException If the scoring scheme is not compatible * with the loaded sequences. * @see CrochemoreLandauZivUkelsonGlobalAlignment * @see CrochemoreLandauZivUkelsonLocalAlignment * @see #computeBlockTable * @see #buildOptimalAlignment */ protected PairwiseAlignment computePairwiseAlignment () throws IncompatibleScoringSchemeException { // compute block table computeBlockTable (); // build and return an optimal global alignment PairwiseAlignment alignment = buildOptimalAlignment (); // allow the block table to be garbage collected block_table = null; return alignment; } /** * Computes the block table (the result depends on subclasses, see * computeBlockTable for details) and requests subclasses to locate the * score of the highest scoring alignment between the two sequences in the block * table. The result depends on the subclasses, and either a global alignment * (see CrochemoreLandauZivUkelsonGlobalAlignment) or local alignment * score (see CrochemoreLandauZivUkelsonLocalAlignment) will be produced. * *

Subclasses are required to implement the locateScore abstract * method defined by this class according to their own method.

* *

Note that this method calculates the similarity value only (it doesn't trace * back into the block table to retrieve the alignment itself).

* * @return the score of the highest scoring alignment between the loaded sequences * @throws IncompatibleScoringSchemeException If the scoring scheme is not compatible * with the loaded sequences. * @see CrochemoreLandauZivUkelsonGlobalAlignment * @see CrochemoreLandauZivUkelsonLocalAlignment * @see #locateScore */ protected int computeScore () throws IncompatibleScoringSchemeException { // compute block table computeBlockTable (); // get score int score = locateScore (); // allow the block table to be garbage collected block_table = null; return score; } /** * Computes the block table. This method is a general specification of how the block * table should be computed. It creates the block table according to the number of * factors of each sequence. It then goes over each position of the block table, * retrieves the corresponding factors from each sequence, and repasses the * information to the subclasses that will do the actual computation of each block * using the scoring scheme previously set. * *

There are four different cases that defines four abstract methods in this class, * which subclasses must implement:

* *
    *
  • createRootBlock - creates and computes block at row 0, column 0; *
  • createFirstColumnBlock - creates and computes blocks at column 0 which * corresponds to alignment blocks between factors of sequence 1 and an empty string; *
  • createFirstRowBlock - creates and computes blocks at row 0 which * corresponds to alignment blocks between factors of sequence 2 and an empty string; *
  • createBlock - creates and computes blocks at row > 0 and column > 0 * which corresponds to alignment blocks between one factor of sequence 1 and one * factor of sequence 2; *
* *

Note that each factor has a serial number which indicates its order in the list * of factors of a sequence. This number will match with the row and column index of * a block in the block table. For instance, if a block has factors F1 and F2 with * serial numbers 12 and 53, respectively, this means that this block is found at row * 12, column 53 of the block table.

* * @throws IncompatibleScoringSchemeException If the scoring scheme is not compatible * with the loaded sequences. * @see #createRootBlock * @see #createFirstColumnBlock * @see #createFirstRowBlock * @see #createBlock */ protected void computeBlockTable () throws IncompatibleScoringSchemeException { Factor factor1, factor2; int r, c, max_length; // create block table block_table = new AlignmentBlock[num_rows][num_cols]; // find the length of the longest sequence (number of characters) max_length = Math.max(seq1.numChars(), seq2.numChars()); // prepares the OUT matrix object out_matrix.init (max_length, scoring.maxAbsoluteScore()); // start at the root of each trie factor1 = seq1.getRootFactor(); factor2 = seq2.getRootFactor(); // check if roots' indexes are both zero if (factor1.getSerialNumber() != 0 || factor2.getSerialNumber() != 0) throw new IndexOutOfBoundsException ("Unexpected factor index."); // initiate first cell of block table block_table[0][0] = createRootBlock (factor1, factor2); // compute first row for (c = 1; c < num_cols; c++) { factor2 = factor2.getNext(); // check if factor2's index equals its column number (except for // the last factor that can be a repetition of a previous one) if (c < num_cols - 1 && factor2.getSerialNumber() != c) throw new IndexOutOfBoundsException ("Unexpected factor index."); block_table[0][c] = createFirstRowBlock (factor1, factor2, c); } for (r = 1; r < num_rows; r++) { factor1 = factor1.getNext(); // check if factor1's index equals its row number (except for // the last factor that can be a repetition of a previous one) if (r < num_rows - 1 && factor1.getSerialNumber() != r) throw new IndexOutOfBoundsException ("Unexpected factor index."); // go back to the root of sequence 2 factor2 = seq2.getRootFactor(); if (factor2.getSerialNumber() != 0) throw new IndexOutOfBoundsException ("Unexpected factor index."); // compute first column of current row block_table[r][0] = createFirstColumnBlock (factor1, factor2, r); for (c = 1; c < num_cols; c++) { factor2 = factor2.getNext(); // check if factor2's index equals its column number (except for // the last factor that can be a repetition of a previous one) if (c < num_cols - 1 && factor2.getSerialNumber() != c) throw new IndexOutOfBoundsException ("Unexpected factor index."); // compute row r, col c block_table[r][c] = createBlock (factor1, factor2, r, c); } } } /** * Assembles the DIST matrix of a block by retrieving the DIST columns of its prefix * blocks. In fact, it also stores pointers to the owner block for each column * retrieved. These pointers are later used during the trace back procedure that * builds an optimal alignment from the information computed in the block table. This * method is general enough to suit both global and local alignment versions of the * algorithm. * * @param block the block for which the DIST matrix is needed * @param dim the dimension of the DIST matrix * @param r the row index of this block in the block table * @param c the column index of this block in the block table * @param lc the number of columns of the alignment block * @return the DIST matrix */ protected int[][] assembleDistMatrix (AlignmentBlock block, int dim, int r, int c, int lc) { AlignmentBlock ancestor; Factor parent; int[][] dist; int i; dist = new int[dim][]; // columns to the left of lc parent = block.factor2.getAncestor(); for (i = lc - 1; i >= 0; i--) { ancestor = block_table[r][parent.getSerialNumber()]; block.ancestor[i] = ancestor; dist[i] = ancestor.dist_column; parent = parent.getAncestor(); } // column lc dist[lc] = block.dist_column; block.ancestor[lc] = block; // columns to the right of lc parent = block.factor1.getAncestor(); for (i = lc + 1; i < dim; i++) { ancestor = block_table[parent.getSerialNumber()][c]; block.ancestor[i] = ancestor; dist[i] = ancestor.dist_column; parent = parent.getAncestor(); } return dist; } /** * Assembles the input border of a block by retrieving the values at the output * borders of the left and top blocks. This method is general enough to suit both * global and local alignment versions of the algorithm. Note that it can be used to * assemble the input border of any block but the root one (it will cause an * ArrayIndexOutOfBoundsException. * * @param dim dimension of the input border * @param r row index of the block in the block table * @param c column index of the block in the block table * @param lr number of row of the block * @return the block's input border */ protected int[] assembleInputBorder (int dim, int r, int c, int lr) { AlignmentBlock left = null, top = null; int[] input; int i; input = new int [dim]; // set up pointers to the left and top blocks (if applicable) if (c > 0) left = block_table[r][c-1]; if (r > 0) top = block_table[r-1][c]; for (i = 0; i < dim; i++) { if (i < lr) { if (left != null) input[i] = left.output_border[left.factor2.length() + i]; else // there is no block to the left, so set a big negative value // to make sure it will not be used (unfortunately, MIN_VALUE // can overflow to a positive value when substracted by any // number, so we use half of it as a workaround) input[i] = Integer.MIN_VALUE / 2; } else if (i == lr) { if (left != null) input[i] = left.output_border[left.factor2.length() + i]; else // no need to check if top is not null // (because we assume this is not the root block) input[i] = top.output_border[i - lr]; } else { if (top != null) input[i] = top.output_border[i - lr]; else // there is no top block (see note for the left case) input[i] = Integer.MIN_VALUE / 2; } } return input; } /** * Traverses a block to retrieve a part of an optimal alignment from the specified * source in the output border to an entry in the input border. * * @param block the block to be traversed * @param source the source of the path in the output border * @param gapped_seq1 the StringBuffer to where the gapped sequence 1 is written to * @param tag_line the StringBuffer to where the tag_line is written to * @param gapped_seq2 the StringBuffer to where the gapped sequence 2 is written to * @throws IncompatibleScoringSchemeException If the scoring scheme is not compatible * with the loaded sequences. */ protected void traverseBlock (AlignmentBlock block, int source, StringBuffer gapped_seq1, StringBuffer tag_line, StringBuffer gapped_seq2) throws IncompatibleScoringSchemeException { char char1, char2; while (block.direction[source] != STOP_DIRECTION) { char1 = block.factor1.getNewChar(); char2 = block.factor2.getNewChar(); switch (block.direction[source]) { case LEFT_DIRECTION: gapped_seq1.insert (0, GAP_CHARACTER); tag_line.insert (0, GAP_TAG); gapped_seq2.insert (0, char2); block = getLeftPrefix (block); break; case DIAGONAL_DIRECTION: gapped_seq1.insert (0, char1); if (char1 == char2) if (useMatchTag()) tag_line.insert (0, MATCH_TAG); else tag_line.insert (0, char1); else if (scoreSubstitution(char1, char2) > 0) tag_line.insert (0, APPROXIMATE_MATCH_TAG); else tag_line.insert (0, MISMATCH_TAG); gapped_seq2.insert (0, char2); block = getDiagonalPrefix (block); source --; break; case TOP_DIRECTION: gapped_seq1.insert (0, char1); tag_line.insert (0, GAP_TAG); gapped_seq2.insert (0, GAP_CHARACTER); block = getTopPrefix (block); source --; break; } } } /** * This method is a shorthand to retrieve the left prefix of a block from the block * table. * * @param block the block * @return the block's left prefix */ protected AlignmentBlock getLeftPrefix (AlignmentBlock block) { int prefix_row = block.factor1.getSerialNumber(); int prefix_col = block.factor2.getAncestorSerialNumber(); return block_table[prefix_row][prefix_col]; } /** * This method is a shorthand to retrieve the diagonal prefix of a block from the * block table. * * @param block the block * @return the block's diagonal prefix */ protected AlignmentBlock getDiagonalPrefix (AlignmentBlock block) { int prefix_row = block.factor1.getAncestorSerialNumber(); int prefix_col = block.factor2.getAncestorSerialNumber(); return block_table[prefix_row][prefix_col]; } /** * This method is a shorthand to retrieve the top prefix of a block from the block * table. * * @param block the block * @return the block's top prefix */ protected AlignmentBlock getTopPrefix (AlignmentBlock block) { int prefix_row = block.factor1.getAncestorSerialNumber(); int prefix_col = block.factor2.getSerialNumber(); return block_table[prefix_row][prefix_col]; } /** * Computes the root block of the block table. See subclasses for actual * implementation. * * @param factor1 the factor of the first sequence being aligned * @param factor2 the factor of the second sequence being aligned * @return the root block * @throws IncompatibleScoringSchemeException If the scoring scheme is not compatible * with the loaded sequences. */ protected abstract AlignmentBlock createRootBlock (Factor factor1, Factor factor2) throws IncompatibleScoringSchemeException; /** * Computes a block at the first row (row zero) of the block table, which corresponds * to an alignment block between one factor of sequence 2 and an empty string. See * subclasses for actual implementation. * * @param factor1 the factor of the first sequence being aligned * @param factor2 the factor of the second sequence being aligned * @param col the column index of block in the block table * @return the computed block * @throws IncompatibleScoringSchemeException If the scoring scheme is not compatible * with the loaded sequences. */ protected abstract AlignmentBlock createFirstRowBlock (Factor factor1, Factor factor2, int col) throws IncompatibleScoringSchemeException; /** * Computes a block at the first column (column zero) of the block table, which * corresponds to an alignment block between one factor of sequence 1 and an empty * string. See subclasses for actual implementation. * * @param factor1 the factor of the first sequence being aligned * @param factor2 the factor of the second sequence being aligned * @param row the row index of block in the block table * @return the computed block * @throws IncompatibleScoringSchemeException If the scoring scheme is not compatible * with the loaded sequences. */ protected abstract AlignmentBlock createFirstColumnBlock (Factor factor1, Factor factor2, int row) throws IncompatibleScoringSchemeException; /** * Computes a block of the block table, which corresponds to an alignment block * between one factor of sequence 1 and one factor of sequence 2. See subclasses for * actual implementation. * * @param factor1 the factor of the first sequence being aligned * @param factor2 the factor of the second sequence being aligned * @param row the row index of block in the block table * @param col the column index of block in the block table * @return the computed block * @throws IncompatibleScoringSchemeException If the scoring scheme is not compatible * with the loaded sequences. */ protected abstract AlignmentBlock createBlock (Factor factor1, Factor factor2, int row, int col) throws IncompatibleScoringSchemeException; /** * Retrieves an optimal alignment between the loaded sequences. See subclasses for * actual implementation. * * @return the computed block * @throws IncompatibleScoringSchemeException If the scoring scheme is not compatible * with the loaded sequences. */ protected abstract PairwiseAlignment buildOptimalAlignment () throws IncompatibleScoringSchemeException; /** * Locates the score of the highest scoring alignment between the two sequences in the * block table after is thas been computed. See subclasses for actual implementation. * * @return the score of the highest scoring alignment between the loaded sequences * @throws IncompatibleScoringSchemeException If the scoring scheme is not compatible * with the loaded sequences. */ protected abstract int locateScore (); }




© 2015 - 2025 Weber Informatics LLC | Privacy Policy