com.aliasi.spell.WeightedEditDistance Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of aliasi-lingpipe Show documentation
This is the original Lingpipe: http://alias-i.com/lingpipe/web/download.html There were not made any changes to the source code.
There is a newer version: 4.1.2-JL1.0
Show newest version
/*
 * LingPipe v. 4.1.0
 * Copyright (C) 2003-2011 Alias-i
 *
 * This program is licensed under the Alias-i Royalty Free License
 * Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the Alias-i
 * Royalty Free License Version 1 for more details.
 * 
 * You should have received a copy of the Alias-i Royalty Free License
 * Version 1 along with this program; if not, visit
 * http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact
 * Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211,
 * +1 (718) 290-9170.
 */

package com.aliasi.spell;

import com.aliasi.util.Distance;
import com.aliasi.util.Proximity;

/**
 * The WeightedEditDistance class implements both the
 * proximity and distance interfaces based on the negative proximity
 * weights assigned to independent atomic edit operations.
 *
 * Weights Scaled as Log Probability
 *
 * Weights on edit operations are scaled as log probabilities.  
 * Practically speaking, this means that the larger the weight, the
 * more likely the edit operation; keep in mind that -1 is larger than
 * -3, representing 2^-1 = 1/2 and
 * 2^-3 = 1/8 respectively on a linear
 * probability scale.
 *
 * 
Proximity and Edit Sequences
 
 * The log probability of a sequence of independent edits is the
 * sum of the log probabilities of the individual edits.  Proximity
 * between strings s1 and s2 is defined as
 * the maximum sum of edit weights over sequences of edits that
 * convert s1 to s2.
 *
 * 
Like the individual edit weights, proximity is scaled as
 * a log probability of the complete edit.  The larger the proximity,
 * the closer the strings; again, keep in mind that -10 is larger than
 * -20, representing roughly 1/1000 and 1/1,000,000 on the linear
 * probability scale.
 *
 * 
Distance is Negative Proximity
 *
 * Distance is just negative proximity.  This scales edit distances
 * in the usual way, with distance of 3 between strings indicating they
 * are further away from each other than strings at distance 1.25.
 *
 * 
Relation to Simple Edit Distance
 * 
 * This class generalizes the behavior of the class
 * spell.EditDistance without extending it in the inheritance
 * sense.  Weighted edit distance agrees with edit distance (up to
 * arithmetic precision) as a distance assuming the following weights:
 * match weight is 0, substitute, insert and delete weights are
 * -1, and the transposition weight is -1 if
 * transpositions are allowed in the edit distance and
 * Double.NEGATIVE_INFINITY otherwise.
 *
 * 
Symmetry
 * 
 * If the substitution and transposition weights are symmetric and
 * the insert and delete costs of a character are equal, then weighted
 * edit distance will be symmetric.  
 * 
 * 
Metricity
 *
 * If the match weight of all
 * characters is zero, then the distance between a character sequence
 * and itself will be zero.  

 * 
If transpose weights are negative infinity so that transposition is
 * not allowed, and if the assignment of substitution weights forms a
 * metric (see {@link Distance} for a definition), and if delete and
 * insert weights are non-negative and equal for all characters, and
 * if match weights are all zero, then weighted edit distance will
 * form a proper metric.  Other values may also form metrics, such as
 * a weight of -1 for all edits other than transpose.
 *
 *
 * 
Probabilistic Channel
 *
 * A probabilistic relational model between strings is defined if
 * the weights are properly scaled as log probabilities.  Because
 * probabilities are between 0 and 1, log probabilities will be
 * between negative infinity and zero.  Proximity between two strings
 * in and out is defined by:
 *
 * 
 * proximity(in,out)
 * = Max_{_edit(in)=out} log2 P(edit)
 * 
 *
 * where the cost of the edit is defined to be:
 *
 * 
 * log2 P(edit) 
 * 
 = log2 P(edit₀,...,edit_n-1)
 * 
 ~ log2 P(edit₀) + ... + log P(edit_n-1)
 * 
 *
 * The last line is an approximation assuming edits are
 * independent.
 * 
 * In order to create a proper probabilistic channel, exponentiated
 * edit weights must sum to 1.0.  This is not technically possible
 * with a local model if transposition is allowed, because of boundary
 * conditions and independence assumptions.
 * 
 * It is possible to define a proper channel if transposition is off,
 * and if all edit weights for a position (including all sequences of
 * arbitrarily long insertions) sum to 1.0.  In particular, if any
 * edits at all are allowed (have finite weights), then there must be
 * a non-zero weight assigned to matching, otherwise exponentiated
 * edit weight sum would exceed 1.0.  It is always possible to add an
 * offset to normalize the values to a probability model (the offset
 * will be negative if the sum exceeds 1.0 and positive if it falls
 * below 1.0 and zero otherwise).

 * 
A fully probabilistic model would have to take the sum over all
 * edits rather than the maximum.  This class makes the so-called
 * Viterbi approximation, assuming the full probability is close to
 * that of the best probability, or at least proportional to it.
 * 
 * 
 * @author  Bob Carpenter
 * @version 3.0
 * @since   LingPipe2.0
 */
public abstract class WeightedEditDistance 
    implements Distance,
               Proximity {

    /**
     * Construct a weighted edit distance.
     */
    public WeightedEditDistance() {
        /* do nothing */
    }

    /**
     * Returns the weighted edit distance between the specified
     * character sequences.  If the edit distances are interpreted as
     * entropies, this distance may be interpreted as the entropy of
     * the best edit path converting the input character sequence to
     * the output sequence.  The first argument is taken to be the
     * input and the second argument the output.
     *
     * 
This method is thread
     * safe and may be accessed concurrently if the abstract weighting
     * methods are thread safe.
     *
     * @param csIn First character sequence.
     * @param csOut Second character sequence.
     * @return The edit distance between the sequences.
     */
    public double distance(CharSequence csIn, CharSequence csOut) {
        return -proximity(csIn,csOut);
    }

    /**
     * Returns the weighted proximity between the specified character
     * sequences. The first argument is taken to be the input and the
     * second argument the output.
     *
     * 
This method is thread safe and may be accessed concurrently
     * if the abstract weighting methods are thread safe.
     *
     * @param csIn First character sequence.
     * @param csOut Second character sequence.
     * @return The edit distance between the sequences.
     */
    public double proximity(CharSequence csIn, CharSequence csOut) {
        return distance(csIn,csOut,true);
    }

    /**
     * Returns the weighted edit distance between the specified
     * character sequences ordering according to the specified
     * similarity ordering.  The first argument is taken to
     * be the input and the second argument the output. 
     * If the boolean flag for similarity is set to true,
     * the distance is treated as a similarity measure, where
     * larger values are closer; if it is false, 
     * smaller values are closer.
     *
     * 
This method is thread safe and may be accessed concurrently
     * if the abstract weighting methods are thread safe.
     *
     * @param csIn First character sequence.
     * @param csOut Second character sequence.
     * @param isSimilarity Set to true if distances are
     * similarities, false if they are dissimilarities.
     */
    double distance(CharSequence csIn, CharSequence csOut,
                    boolean isSimilarity) {

        // can't reverse to make csOut always smallest, because weights
        // may be asymmetric

        if (csOut.length() == 0) {  // all deletes
            double sum = 0.0;
            for (int i = 0; i < csIn.length(); ++i)
                sum += deleteWeight(csIn.charAt(i));
            return sum;
        }
        if (csIn.length() == 0) { // all inserts
            double sum = 0.0;
            for (int j = 0; j < csOut.length(); ++j)
                sum += insertWeight(csOut.charAt(j));
            return sum;
        }
    
        int xsLength = csIn.length() + 1;  // >= 2
        int ysLength = csOut.length() + 1; // >= 2

        // x=0: first slice, all inserts
        double lastSlice[] = new double[ysLength];
        lastSlice[0] = 0.0;  // upper left corner of lattice
        for (int y = 1; y < ysLength; ++y)
            lastSlice[y] = lastSlice[y-1] + insertWeight(csOut.charAt(y-1));

        // x=1: second slice, no transpose
        double[] currentSlice = new double[ysLength];
        currentSlice[0] = insertWeight(csOut.charAt(0));
        char cX = csIn.charAt(0);
        for (int y = 1; y < ysLength; ++y) {
            int yMinus1 = y-1;
            char cY = csOut.charAt(yMinus1);
            double matchSubstWeight 
                = lastSlice[yMinus1]
                +  ((cX == cY) ? matchWeight(cX) : substituteWeight(cX,cY));
            double deleteWeight = lastSlice[y] + deleteWeight(cX);
            double insertWeight = currentSlice[yMinus1] + insertWeight(cY);
            currentSlice[y] = best(isSimilarity,
                                   matchSubstWeight,
                                   deleteWeight,
                                   insertWeight);
        }
    
        // avoid third array allocation if possible
        if (xsLength == 2) return currentSlice[currentSlice.length-1];

        char cYZero = csOut.charAt(0);
        double[] twoLastSlice = new double[ysLength];

        // x>1:transpose after first element
        for (int x = 2; x < xsLength; ++x) {
            char cXMinus1 = cX;
            cX = csIn.charAt(x-1);

            // rotate slices
            double[] tmpSlice = twoLastSlice;
            twoLastSlice = lastSlice;
            lastSlice = currentSlice;
            currentSlice = tmpSlice;

            currentSlice[0] = lastSlice[0] + deleteWeight(cX); 

            // y=1: no transpose here
            currentSlice[1] = best(isSimilarity,
                                   (cX == cYZero)
                                   ? (lastSlice[0] + matchWeight(cX))
                                   : (lastSlice[0] + substituteWeight(cX,cYZero)),
                                   lastSlice[1] + deleteWeight(cX),
                                   currentSlice[0] + insertWeight(cYZero));
        
            // y > 1: transpose
            char cY = cYZero;
            for (int y = 2; y < ysLength; ++y) {
                int yMinus1 = y-1;
                char cYMinus1 = cY;
                cY = csOut.charAt(yMinus1);
                currentSlice[y] = best(isSimilarity,
                                       (cX == cY)
                                       ? (lastSlice[yMinus1] + matchWeight(cX))
                                       : (lastSlice[yMinus1] + substituteWeight(cX,cY)),
                                       lastSlice[y] + deleteWeight(cX),
                                       currentSlice[yMinus1] + insertWeight(cY));
                if (cX == cYMinus1 && cY == cXMinus1)
                    currentSlice[y] = best(isSimilarity,
                                           currentSlice[y],
                                           twoLastSlice[y-2] + transposeWeight(cXMinus1,cX));
            }
        }
        return currentSlice[currentSlice.length-1];
    }

    private double best(boolean isSimilarity, double x, double y, double z) {
        return best(isSimilarity,x,best(isSimilarity,y,z));
    }

    private double best(boolean isSimilarity, double x, double y) {
        return isSimilarity
            ? Math.max(x,y)
            : Math.min(x,y);
    }

    /**
     * Returns the weight of matching the specified character.  For
     * most weighted edit distances, the match weight is zero so that
     * identical strings are total distance zero apart.
     *
     * 
All weights should be less than or equal to zero, with
     * heavier weights being larger absolute valued negatives.
     * Basically, the weights may be treated as unscaled log
     * probabilities.  Thus valid values will range between 0.0
     * (probablity 1) and {@link Double#NEGATIVE_INFINITY}
     * (probability 0).  See the class documentation above for more
     * information.
     *
     * @param cMatched Character matched.
     * @return Weight of matching character.
     */
    public abstract double matchWeight(char cMatched);

    /**
     * Returns the weight of deleting the specified character.
     *
     * 
All weights should be less than or equal to zero, with
     * heavier weights being larger absolute valued negatives.
     * Basically, the weights may be treated as unscaled log
     * probabilities.  Thus valid values will range between 0.0
     * (probablity 1) and {@link Double#NEGATIVE_INFINITY}
     * (probability 0).  See the class documentation above for more
     * information.
     *
     * @param cDeleted Character deleted.
     * @return Weight of deleting character.
     */
    public abstract double deleteWeight(char cDeleted);

    /**
     * Returns the weight of inserting the specified character.
     *
     * 
All weights should be less than or equal to zero, with
     * heavier weights being larger absolute valued negatives.
     * Basically, the weights may be treated as unscaled log
     * probabilities.  Thus valid values will range between 0.0
     * (probablity 1) and {@link Double#NEGATIVE_INFINITY}
     * (probability 0).  See the class documentation above for more
     * information.
     *
     * @param cInserted Character inserted.
     * @return Weight of inserting character.
     */
    public abstract double insertWeight(char cInserted);

    /**
     * Returns the weight of substituting the inserted character for
     * the deleted character.
     *
     * 
All weights should be less than or equal to zero, with
     * heavier weights being larger absolute valued negatives.
     * Basically, the weights may be treated as unscaled log
     * probabilities.  Thus valid values will range between 0.0
     * (probablity 1) and {@link Double#NEGATIVE_INFINITY}
     * (probability 0).  See the class documentation above for more
     * information.
     * 
     * @param cDeleted Deleted character.
     * @param cInserted Inserted character.
     * @return The weight of substituting the inserted character for
     * the deleted character.
     */
    public abstract double substituteWeight(char cDeleted, char cInserted);

    /**
     * Returns the weight of transposing the specified characters.  Note
     * that the order of arguments follows that of the input.
     *
     * All weights should be less than or equal to zero, with
     * heavier weights being larger absolute valued negatives.
     * Basically, the weights may be treated as unscaled log
     * probabilities.  Thus valid values will range between 0.0
     * (probablity 1) and {@link Double#NEGATIVE_INFINITY}
     * (probability 0).  See the class documentation above for more
     * information.
     * 
     * @param cFirst First character in input.
     * @param cSecond Second character in input.
     * @return The weight of transposing the specified characters.
     */
    public abstract double transposeWeight(char cFirst, char cSecond);


}