com.aliasi.spell.WeightedEditDistance Maven / Gradle / Ivy
Show all versions of aliasi-lingpipe Show documentation
/*
* LingPipe v. 4.1.0
* Copyright (C) 2003-2011 Alias-i
*
* This program is licensed under the Alias-i Royalty Free License
* Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Alias-i
* Royalty Free License Version 1 for more details.
*
* You should have received a copy of the Alias-i Royalty Free License
* Version 1 along with this program; if not, visit
* http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact
* Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211,
* +1 (718) 290-9170.
*/
package com.aliasi.spell;
import com.aliasi.util.Distance;
import com.aliasi.util.Proximity;
/**
* The WeightedEditDistance
class implements both the
* proximity and distance interfaces based on the negative proximity
* weights assigned to independent atomic edit operations.
*
* Weights Scaled as Log Probability
*
* Weights on edit operations are scaled as log probabilities.
* Practically speaking, this means that the larger the weight, the
* more likely the edit operation; keep in mind that -1 is larger than
* -3, representing 2-1 = 1/2
and
* 2-3 = 1/8
respectively on a linear
* probability scale.
*
*
Proximity and Edit Sequences
* The log probability of a sequence of independent edits is the
* sum of the log probabilities of the individual edits. Proximity
* between strings s1
and s2
is defined as
* the maximum sum of edit weights over sequences of edits that
* convert s1
to s2
.
*
*
Like the individual edit weights, proximity is scaled as
* a log probability of the complete edit. The larger the proximity,
* the closer the strings; again, keep in mind that -10 is larger than
* -20, representing roughly 1/1000 and 1/1,000,000 on the linear
* probability scale.
*
*
Distance is Negative Proximity
*
* Distance is just negative proximity. This scales edit distances
* in the usual way, with distance of 3 between strings indicating they
* are further away from each other than strings at distance 1.25.
*
*
Relation to Simple Edit Distance
*
* This class generalizes the behavior of the class
* spell.EditDistance
without extending it in the inheritance
* sense. Weighted edit distance agrees with edit distance (up to
* arithmetic precision) as a distance assuming the following weights:
* match weight is 0, substitute, insert and delete weights are
* -1
, and the transposition weight is -1
if
* transpositions are allowed in the edit distance and
* Double.NEGATIVE_INFINITY
otherwise.
*
*
Symmetry
*
* If the substitution and transposition weights are symmetric and
* the insert and delete costs of a character are equal, then weighted
* edit distance will be symmetric.
*
*
Metricity
*
* If the match weight of all
* characters is zero, then the distance between a character sequence
* and itself will be zero.
*
If transpose weights are negative infinity so that transposition is
* not allowed, and if the assignment of substitution weights forms a
* metric (see {@link Distance} for a definition), and if delete and
* insert weights are non-negative and equal for all characters, and
* if match weights are all zero, then weighted edit distance will
* form a proper metric. Other values may also form metrics, such as
* a weight of -1 for all edits other than transpose.
*
*
*
Probabilistic Channel
*
* A probabilistic relational model between strings is defined if
* the weights are properly scaled as log probabilities. Because
* probabilities are between 0 and 1, log probabilities will be
* between negative infinity and zero. Proximity between two strings
* in
and out
is defined by:
*
*
* proximity(in,out)
* = Maxedit(in)=out log2 P(edit)
*
*
* where the cost of the edit is defined to be:
*
*
* log2 P(edit)
*
= log2 P(edit0,...,editn-1)
*
~ log2 P(edit0) + ... + log P(editn-1)
*
*
* The last line is an approximation assuming edits are
* independent.
*
* In order to create a proper probabilistic channel, exponentiated
* edit weights must sum to 1.0. This is not technically possible
* with a local model if transposition is allowed, because of boundary
* conditions and independence assumptions.
*
* It is possible to define a proper channel if transposition is off,
* and if all edit weights for a position (including all sequences of
* arbitrarily long insertions) sum to 1.0. In particular, if any
* edits at all are allowed (have finite weights), then there must be
* a non-zero weight assigned to matching, otherwise exponentiated
* edit weight sum would exceed 1.0. It is always possible to add an
* offset to normalize the values to a probability model (the offset
* will be negative if the sum exceeds 1.0 and positive if it falls
* below 1.0 and zero otherwise).
*
A fully probabilistic model would have to take the sum over all
* edits rather than the maximum. This class makes the so-called
* Viterbi approximation, assuming the full probability is close to
* that of the best probability, or at least proportional to it.
*
*
* @author Bob Carpenter
* @version 3.0
* @since LingPipe2.0
*/
public abstract class WeightedEditDistance
implements Distance,
Proximity {
/**
* Construct a weighted edit distance.
*/
public WeightedEditDistance() {
/* do nothing */
}
/**
* Returns the weighted edit distance between the specified
* character sequences. If the edit distances are interpreted as
* entropies, this distance may be interpreted as the entropy of
* the best edit path converting the input character sequence to
* the output sequence. The first argument is taken to be the
* input and the second argument the output.
*
* This method is thread
* safe and may be accessed concurrently if the abstract weighting
* methods are thread safe.
*
* @param csIn First character sequence.
* @param csOut Second character sequence.
* @return The edit distance between the sequences.
*/
public double distance(CharSequence csIn, CharSequence csOut) {
return -proximity(csIn,csOut);
}
/**
* Returns the weighted proximity between the specified character
* sequences. The first argument is taken to be the input and the
* second argument the output.
*
*
This method is thread safe and may be accessed concurrently
* if the abstract weighting methods are thread safe.
*
* @param csIn First character sequence.
* @param csOut Second character sequence.
* @return The edit distance between the sequences.
*/
public double proximity(CharSequence csIn, CharSequence csOut) {
return distance(csIn,csOut,true);
}
/**
* Returns the weighted edit distance between the specified
* character sequences ordering according to the specified
* similarity ordering. The first argument is taken to
* be the input and the second argument the output.
* If the boolean flag for similarity is set to true
,
* the distance is treated as a similarity measure, where
* larger values are closer; if it is false
,
* smaller values are closer.
*
*
This method is thread safe and may be accessed concurrently
* if the abstract weighting methods are thread safe.
*
* @param csIn First character sequence.
* @param csOut Second character sequence.
* @param isSimilarity Set to true
if distances are
* similarities, false if they are dissimilarities.
*/
double distance(CharSequence csIn, CharSequence csOut,
boolean isSimilarity) {
// can't reverse to make csOut always smallest, because weights
// may be asymmetric
if (csOut.length() == 0) { // all deletes
double sum = 0.0;
for (int i = 0; i < csIn.length(); ++i)
sum += deleteWeight(csIn.charAt(i));
return sum;
}
if (csIn.length() == 0) { // all inserts
double sum = 0.0;
for (int j = 0; j < csOut.length(); ++j)
sum += insertWeight(csOut.charAt(j));
return sum;
}
int xsLength = csIn.length() + 1; // >= 2
int ysLength = csOut.length() + 1; // >= 2
// x=0: first slice, all inserts
double lastSlice[] = new double[ysLength];
lastSlice[0] = 0.0; // upper left corner of lattice
for (int y = 1; y < ysLength; ++y)
lastSlice[y] = lastSlice[y-1] + insertWeight(csOut.charAt(y-1));
// x=1: second slice, no transpose
double[] currentSlice = new double[ysLength];
currentSlice[0] = insertWeight(csOut.charAt(0));
char cX = csIn.charAt(0);
for (int y = 1; y < ysLength; ++y) {
int yMinus1 = y-1;
char cY = csOut.charAt(yMinus1);
double matchSubstWeight
= lastSlice[yMinus1]
+ ((cX == cY) ? matchWeight(cX) : substituteWeight(cX,cY));
double deleteWeight = lastSlice[y] + deleteWeight(cX);
double insertWeight = currentSlice[yMinus1] + insertWeight(cY);
currentSlice[y] = best(isSimilarity,
matchSubstWeight,
deleteWeight,
insertWeight);
}
// avoid third array allocation if possible
if (xsLength == 2) return currentSlice[currentSlice.length-1];
char cYZero = csOut.charAt(0);
double[] twoLastSlice = new double[ysLength];
// x>1:transpose after first element
for (int x = 2; x < xsLength; ++x) {
char cXMinus1 = cX;
cX = csIn.charAt(x-1);
// rotate slices
double[] tmpSlice = twoLastSlice;
twoLastSlice = lastSlice;
lastSlice = currentSlice;
currentSlice = tmpSlice;
currentSlice[0] = lastSlice[0] + deleteWeight(cX);
// y=1: no transpose here
currentSlice[1] = best(isSimilarity,
(cX == cYZero)
? (lastSlice[0] + matchWeight(cX))
: (lastSlice[0] + substituteWeight(cX,cYZero)),
lastSlice[1] + deleteWeight(cX),
currentSlice[0] + insertWeight(cYZero));
// y > 1: transpose
char cY = cYZero;
for (int y = 2; y < ysLength; ++y) {
int yMinus1 = y-1;
char cYMinus1 = cY;
cY = csOut.charAt(yMinus1);
currentSlice[y] = best(isSimilarity,
(cX == cY)
? (lastSlice[yMinus1] + matchWeight(cX))
: (lastSlice[yMinus1] + substituteWeight(cX,cY)),
lastSlice[y] + deleteWeight(cX),
currentSlice[yMinus1] + insertWeight(cY));
if (cX == cYMinus1 && cY == cXMinus1)
currentSlice[y] = best(isSimilarity,
currentSlice[y],
twoLastSlice[y-2] + transposeWeight(cXMinus1,cX));
}
}
return currentSlice[currentSlice.length-1];
}
private double best(boolean isSimilarity, double x, double y, double z) {
return best(isSimilarity,x,best(isSimilarity,y,z));
}
private double best(boolean isSimilarity, double x, double y) {
return isSimilarity
? Math.max(x,y)
: Math.min(x,y);
}
/**
* Returns the weight of matching the specified character. For
* most weighted edit distances, the match weight is zero so that
* identical strings are total distance zero apart.
*
*
All weights should be less than or equal to zero, with
* heavier weights being larger absolute valued negatives.
* Basically, the weights may be treated as unscaled log
* probabilities. Thus valid values will range between 0.0
* (probablity 1) and {@link Double#NEGATIVE_INFINITY}
* (probability 0). See the class documentation above for more
* information.
*
* @param cMatched Character matched.
* @return Weight of matching character.
*/
public abstract double matchWeight(char cMatched);
/**
* Returns the weight of deleting the specified character.
*
*
All weights should be less than or equal to zero, with
* heavier weights being larger absolute valued negatives.
* Basically, the weights may be treated as unscaled log
* probabilities. Thus valid values will range between 0.0
* (probablity 1) and {@link Double#NEGATIVE_INFINITY}
* (probability 0). See the class documentation above for more
* information.
*
* @param cDeleted Character deleted.
* @return Weight of deleting character.
*/
public abstract double deleteWeight(char cDeleted);
/**
* Returns the weight of inserting the specified character.
*
*
All weights should be less than or equal to zero, with
* heavier weights being larger absolute valued negatives.
* Basically, the weights may be treated as unscaled log
* probabilities. Thus valid values will range between 0.0
* (probablity 1) and {@link Double#NEGATIVE_INFINITY}
* (probability 0). See the class documentation above for more
* information.
*
* @param cInserted Character inserted.
* @return Weight of inserting character.
*/
public abstract double insertWeight(char cInserted);
/**
* Returns the weight of substituting the inserted character for
* the deleted character.
*
*
All weights should be less than or equal to zero, with
* heavier weights being larger absolute valued negatives.
* Basically, the weights may be treated as unscaled log
* probabilities. Thus valid values will range between 0.0
* (probablity 1) and {@link Double#NEGATIVE_INFINITY}
* (probability 0). See the class documentation above for more
* information.
*
* @param cDeleted Deleted character.
* @param cInserted Inserted character.
* @return The weight of substituting the inserted character for
* the deleted character.
*/
public abstract double substituteWeight(char cDeleted, char cInserted);
/**
* Returns the weight of transposing the specified characters. Note
* that the order of arguments follows that of the input.
*
*
All weights should be less than or equal to zero, with
* heavier weights being larger absolute valued negatives.
* Basically, the weights may be treated as unscaled log
* probabilities. Thus valid values will range between 0.0
* (probablity 1) and {@link Double#NEGATIVE_INFINITY}
* (probability 0). See the class documentation above for more
* information.
*
* @param cFirst First character in input.
* @param cSecond Second character in input.
* @return The weight of transposing the specified characters.
*/
public abstract double transposeWeight(char cFirst, char cSecond);
}