All Downloads are FREE. Search and download functionalities are using the official Maven repository.

com.aliasi.spell.WeightedEditDistance Maven / Gradle / Ivy

Go to download

This is the original Lingpipe: http://alias-i.com/lingpipe/web/download.html There were not made any changes to the source code.

There is a newer version: 4.1.2-JL1.0
Show newest version
/*
 * LingPipe v. 4.1.0
 * Copyright (C) 2003-2011 Alias-i
 *
 * This program is licensed under the Alias-i Royalty Free License
 * Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the Alias-i
 * Royalty Free License Version 1 for more details.
 * 
 * You should have received a copy of the Alias-i Royalty Free License
 * Version 1 along with this program; if not, visit
 * http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact
 * Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211,
 * +1 (718) 290-9170.
 */

package com.aliasi.spell;

import com.aliasi.util.Distance;
import com.aliasi.util.Proximity;

/**
 * The WeightedEditDistance class implements both the
 * proximity and distance interfaces based on the negative proximity
 * weights assigned to independent atomic edit operations.
 *
 * 

Weights Scaled as Log Probability

* *

Weights on edit operations are scaled as log probabilities. * Practically speaking, this means that the larger the weight, the * more likely the edit operation; keep in mind that -1 is larger than * -3, representing 2-1 = 1/2 and * 2-3 = 1/8 respectively on a linear * probability scale. * *

Proximity and Edit Sequences

*

The log probability of a sequence of independent edits is the * sum of the log probabilities of the individual edits. Proximity * between strings s1 and s2 is defined as * the maximum sum of edit weights over sequences of edits that * convert s1 to s2. * *

Like the individual edit weights, proximity is scaled as * a log probability of the complete edit. The larger the proximity, * the closer the strings; again, keep in mind that -10 is larger than * -20, representing roughly 1/1000 and 1/1,000,000 on the linear * probability scale. * *

Distance is Negative Proximity

* *

Distance is just negative proximity. This scales edit distances * in the usual way, with distance of 3 between strings indicating they * are further away from each other than strings at distance 1.25. * *

Relation to Simple Edit Distance

* *

This class generalizes the behavior of the class * spell.EditDistance without extending it in the inheritance * sense. Weighted edit distance agrees with edit distance (up to * arithmetic precision) as a distance assuming the following weights: * match weight is 0, substitute, insert and delete weights are * -1, and the transposition weight is -1 if * transpositions are allowed in the edit distance and * Double.NEGATIVE_INFINITY otherwise. * *

Symmetry

* *

If the substitution and transposition weights are symmetric and * the insert and delete costs of a character are equal, then weighted * edit distance will be symmetric. * *

Metricity

* *

If the match weight of all * characters is zero, then the distance between a character sequence * and itself will be zero. *

If transpose weights are negative infinity so that transposition is * not allowed, and if the assignment of substitution weights forms a * metric (see {@link Distance} for a definition), and if delete and * insert weights are non-negative and equal for all characters, and * if match weights are all zero, then weighted edit distance will * form a proper metric. Other values may also form metrics, such as * a weight of -1 for all edits other than transpose. * * *

Probabilistic Channel

* *

A probabilistic relational model between strings is defined if * the weights are properly scaled as log probabilities. Because * probabilities are between 0 and 1, log probabilities will be * between negative infinity and zero. Proximity between two strings * in and out is defined by: * *

 * proximity(in,out)
 * = Maxedit(in)=out log2 P(edit)
 * 
* * where the cost of the edit is defined to be: * *
* log2 P(edit) *
= log2 P(edit0,...,editn-1) *
~ log2 P(edit0) + ... + log P(editn-1) *
* * The last line is an approximation assuming edits are * independent. * *

In order to create a proper probabilistic channel, exponentiated * edit weights must sum to 1.0. This is not technically possible * with a local model if transposition is allowed, because of boundary * conditions and independence assumptions. * * It is possible to define a proper channel if transposition is off, * and if all edit weights for a position (including all sequences of * arbitrarily long insertions) sum to 1.0. In particular, if any * edits at all are allowed (have finite weights), then there must be * a non-zero weight assigned to matching, otherwise exponentiated * edit weight sum would exceed 1.0. It is always possible to add an * offset to normalize the values to a probability model (the offset * will be negative if the sum exceeds 1.0 and positive if it falls * below 1.0 and zero otherwise). *

A fully probabilistic model would have to take the sum over all * edits rather than the maximum. This class makes the so-called * Viterbi approximation, assuming the full probability is close to * that of the best probability, or at least proportional to it. * * * @author Bob Carpenter * @version 3.0 * @since LingPipe2.0 */ public abstract class WeightedEditDistance implements Distance, Proximity { /** * Construct a weighted edit distance. */ public WeightedEditDistance() { /* do nothing */ } /** * Returns the weighted edit distance between the specified * character sequences. If the edit distances are interpreted as * entropies, this distance may be interpreted as the entropy of * the best edit path converting the input character sequence to * the output sequence. The first argument is taken to be the * input and the second argument the output. * *

This method is thread * safe and may be accessed concurrently if the abstract weighting * methods are thread safe. * * @param csIn First character sequence. * @param csOut Second character sequence. * @return The edit distance between the sequences. */ public double distance(CharSequence csIn, CharSequence csOut) { return -proximity(csIn,csOut); } /** * Returns the weighted proximity between the specified character * sequences. The first argument is taken to be the input and the * second argument the output. * *

This method is thread safe and may be accessed concurrently * if the abstract weighting methods are thread safe. * * @param csIn First character sequence. * @param csOut Second character sequence. * @return The edit distance between the sequences. */ public double proximity(CharSequence csIn, CharSequence csOut) { return distance(csIn,csOut,true); } /** * Returns the weighted edit distance between the specified * character sequences ordering according to the specified * similarity ordering. The first argument is taken to * be the input and the second argument the output. * If the boolean flag for similarity is set to true, * the distance is treated as a similarity measure, where * larger values are closer; if it is false, * smaller values are closer. * *

This method is thread safe and may be accessed concurrently * if the abstract weighting methods are thread safe. * * @param csIn First character sequence. * @param csOut Second character sequence. * @param isSimilarity Set to true if distances are * similarities, false if they are dissimilarities. */ double distance(CharSequence csIn, CharSequence csOut, boolean isSimilarity) { // can't reverse to make csOut always smallest, because weights // may be asymmetric if (csOut.length() == 0) { // all deletes double sum = 0.0; for (int i = 0; i < csIn.length(); ++i) sum += deleteWeight(csIn.charAt(i)); return sum; } if (csIn.length() == 0) { // all inserts double sum = 0.0; for (int j = 0; j < csOut.length(); ++j) sum += insertWeight(csOut.charAt(j)); return sum; } int xsLength = csIn.length() + 1; // >= 2 int ysLength = csOut.length() + 1; // >= 2 // x=0: first slice, all inserts double lastSlice[] = new double[ysLength]; lastSlice[0] = 0.0; // upper left corner of lattice for (int y = 1; y < ysLength; ++y) lastSlice[y] = lastSlice[y-1] + insertWeight(csOut.charAt(y-1)); // x=1: second slice, no transpose double[] currentSlice = new double[ysLength]; currentSlice[0] = insertWeight(csOut.charAt(0)); char cX = csIn.charAt(0); for (int y = 1; y < ysLength; ++y) { int yMinus1 = y-1; char cY = csOut.charAt(yMinus1); double matchSubstWeight = lastSlice[yMinus1] + ((cX == cY) ? matchWeight(cX) : substituteWeight(cX,cY)); double deleteWeight = lastSlice[y] + deleteWeight(cX); double insertWeight = currentSlice[yMinus1] + insertWeight(cY); currentSlice[y] = best(isSimilarity, matchSubstWeight, deleteWeight, insertWeight); } // avoid third array allocation if possible if (xsLength == 2) return currentSlice[currentSlice.length-1]; char cYZero = csOut.charAt(0); double[] twoLastSlice = new double[ysLength]; // x>1:transpose after first element for (int x = 2; x < xsLength; ++x) { char cXMinus1 = cX; cX = csIn.charAt(x-1); // rotate slices double[] tmpSlice = twoLastSlice; twoLastSlice = lastSlice; lastSlice = currentSlice; currentSlice = tmpSlice; currentSlice[0] = lastSlice[0] + deleteWeight(cX); // y=1: no transpose here currentSlice[1] = best(isSimilarity, (cX == cYZero) ? (lastSlice[0] + matchWeight(cX)) : (lastSlice[0] + substituteWeight(cX,cYZero)), lastSlice[1] + deleteWeight(cX), currentSlice[0] + insertWeight(cYZero)); // y > 1: transpose char cY = cYZero; for (int y = 2; y < ysLength; ++y) { int yMinus1 = y-1; char cYMinus1 = cY; cY = csOut.charAt(yMinus1); currentSlice[y] = best(isSimilarity, (cX == cY) ? (lastSlice[yMinus1] + matchWeight(cX)) : (lastSlice[yMinus1] + substituteWeight(cX,cY)), lastSlice[y] + deleteWeight(cX), currentSlice[yMinus1] + insertWeight(cY)); if (cX == cYMinus1 && cY == cXMinus1) currentSlice[y] = best(isSimilarity, currentSlice[y], twoLastSlice[y-2] + transposeWeight(cXMinus1,cX)); } } return currentSlice[currentSlice.length-1]; } private double best(boolean isSimilarity, double x, double y, double z) { return best(isSimilarity,x,best(isSimilarity,y,z)); } private double best(boolean isSimilarity, double x, double y) { return isSimilarity ? Math.max(x,y) : Math.min(x,y); } /** * Returns the weight of matching the specified character. For * most weighted edit distances, the match weight is zero so that * identical strings are total distance zero apart. * *

All weights should be less than or equal to zero, with * heavier weights being larger absolute valued negatives. * Basically, the weights may be treated as unscaled log * probabilities. Thus valid values will range between 0.0 * (probablity 1) and {@link Double#NEGATIVE_INFINITY} * (probability 0). See the class documentation above for more * information. * * @param cMatched Character matched. * @return Weight of matching character. */ public abstract double matchWeight(char cMatched); /** * Returns the weight of deleting the specified character. * *

All weights should be less than or equal to zero, with * heavier weights being larger absolute valued negatives. * Basically, the weights may be treated as unscaled log * probabilities. Thus valid values will range between 0.0 * (probablity 1) and {@link Double#NEGATIVE_INFINITY} * (probability 0). See the class documentation above for more * information. * * @param cDeleted Character deleted. * @return Weight of deleting character. */ public abstract double deleteWeight(char cDeleted); /** * Returns the weight of inserting the specified character. * *

All weights should be less than or equal to zero, with * heavier weights being larger absolute valued negatives. * Basically, the weights may be treated as unscaled log * probabilities. Thus valid values will range between 0.0 * (probablity 1) and {@link Double#NEGATIVE_INFINITY} * (probability 0). See the class documentation above for more * information. * * @param cInserted Character inserted. * @return Weight of inserting character. */ public abstract double insertWeight(char cInserted); /** * Returns the weight of substituting the inserted character for * the deleted character. * *

All weights should be less than or equal to zero, with * heavier weights being larger absolute valued negatives. * Basically, the weights may be treated as unscaled log * probabilities. Thus valid values will range between 0.0 * (probablity 1) and {@link Double#NEGATIVE_INFINITY} * (probability 0). See the class documentation above for more * information. * * @param cDeleted Deleted character. * @param cInserted Inserted character. * @return The weight of substituting the inserted character for * the deleted character. */ public abstract double substituteWeight(char cDeleted, char cInserted); /** * Returns the weight of transposing the specified characters. Note * that the order of arguments follows that of the input. * *

All weights should be less than or equal to zero, with * heavier weights being larger absolute valued negatives. * Basically, the weights may be treated as unscaled log * probabilities. Thus valid values will range between 0.0 * (probablity 1) and {@link Double#NEGATIVE_INFINITY} * (probability 0). See the class documentation above for more * information. * * @param cFirst First character in input. * @param cSecond Second character in input. * @return The weight of transposing the specified characters. */ public abstract double transposeWeight(char cFirst, char cSecond); }





© 2015 - 2025 Weber Informatics LLC | Privacy Policy