com.aliasi.spell.JaroWinklerDistance Maven / Gradle / Ivy
Show all versions of aliasi-lingpipe Show documentation
/*
* LingPipe v. 4.1.0
* Copyright (C) 2003-2011 Alias-i
*
* This program is licensed under the Alias-i Royalty Free License
* Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Alias-i
* Royalty Free License Version 1 for more details.
*
* You should have received a copy of the Alias-i Royalty Free License
* Version 1 along with this program; if not, visit
* http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact
* Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211,
* +1 (718) 290-9170.
*/
package com.aliasi.spell;
import com.aliasi.util.Distance;
import com.aliasi.util.Proximity;
import java.util.Arrays;
/**
* The JaroWinklerDistance
class implements the original
* Jaro string comparison as well as Winkler's modifications. As a
* distance measure, Jaro-Winkler returns values between
* 0
(exact string match) and 1
(no matching
* characters). Note that this is reversed from the original
* definitions of Jaro and Winkler in order to produce a distance-like
* ordering. The original Jaro-Winkler string comparator returned
* 1
for a perfect match and 0
for
* complete mismatch; our method returns one minus the Jaro-Winkler
* measure.
*
* The Jaro-Winkler distance measure was developed for name comparison
* in the U.S. Census. It is designed to compae surnames to surnames
* and given names to given names, not whole names to whole names.
* There is no character-specific information in this implementation,
* but assumptions are made about typical lengths and the significance
* of initial matches that may not apply to all languages.
*
*
The easiest way to understand the Jaro measure and the Winkler
* variants is procedurally. The Jaro measure involves two steps,
* first to compute the number of "matches" and second to
* compute the number of "transpositions". The Winkler
* adjustment involves a final rescoring based on an exact match score
* for the initial characters of both strings.
*
*
Formal Definition of Jaro-Winkler Distance
*
* Suppose we are comparing character sequences cs1
* and cs2
. The Jaro-Winkler distance is defined by
* the following steps. After the definitions, we consider some
* examples.
*
*
Step 1: Matches: The match phase is a greedy alignment
* step of characters in one string against the characters in another
* string. The maximum distance (measured by array index) at which
* characters may be matched is defined by:
*
*
* matchRange = max(cs1.length(), cs2.length()) / 2 - 1
*
* The match phase is a greedy alignment that proceeds character by
* character through the first string, though the distance metric is
* symmetric (that, is reversing the order of arguments does not
* affect the result). For each character encountered in the first
* string, it is matched to the first unaligned character in the
* second string that is an exact character match. If there is no
* such character within the match range window, the character is left
* unaligned.
*
*
Step 2: Transpositions: After matching, the subsequence
* of characters actually matched in both strings is extracted. These
* subsequences will be the same length. The number of characters in
* one string that do not line up (by index in the matched
* subsequence) with identical characters in the other string is the
* number of "half transpositions". The total number of
* transpoisitons is the number of half transpositions divided by two,
* rounding down.
*
*
The Jaro distance is then defined in terms of the number
* of matching characters matches
and the number
* of transpositions, transposes
:
*
*
* jaroProximity(cs1,cs2)
* = ( matches(cs1,cs2) / cs1.length()
* + matches(cs1,cs2) / cs2.length()
* + (matches(cs1,cs2) - transposes(cs1,cs2)) / matches(cs1,cs2) ) / 3
*
* jaroDistance(cs1,cs2) = 1 - jaroProximity(cs1,cs2)
*
* In words, the measure is the average of three values; (a) the
* percentage of the first string matched, (b) the percentage of the
* second string matched, and (c) the percentage of matches that were
* not transposed.
*
*
Step 3: Winkler Modification The Winkler modification to
* the Jaro comparison, resulting in the Jaro-Winkler comparison,
* boosts scores for strings that match character for character
* initially. Let boostThreshold
be the minimum score
* for a string that gets boosted. This value was set to
* 0.7
in Winkler's papers (see references below). If
* the Jaro score is below the boost threshold, the Jaro score is
* returned unadjusted. The second parameter for the Winkler
* modification is the size of the initial prefix considered,
* prefixSize
. The prefix size was set to 4
* in Winkler's papers. Next, let
* prefixMatch(cs1,cs2,prefixSize)
be the number of
* characters in the prefix of cs1
and cs2
* that exactly match (by original index), up to a maximum of
* prefixSize
. The modified distance is then defined to
* be:
*
*
* jaroWinklerProximity(cs1,cs2,boostThreshold,prefixSize)
* = jaroMeasure(cs1,cs2) <= boostThreshold
* ? jaroMeasure(cs1,cs2)
* : jaroMeasure(cs1,cs2)
* + 0.1 * prefixMatch(cs1,cs2,prefixSize) * (1.0 - jaroDistance(cs1,cs2))
*
* jaroWinklerDistance(cs1,cs2,boostThreshold,prefixSize)
* = 1 - jaroWinklerProximity(cs1,cs2,boostThreshold,prefixSize)
*
* Examples: We will present the alignment steps in the form
* of tables, with offsets in the second string below the first string
* positions that match. For a simple example, consider comparing the
* given (nick)name AL
to itself. Both strings are of
* length 2. Thus the maximum match distance is max(2,2)/2 - 1
* = 0
, meaning all matches must be exact. The matches are
* illustrated in the following table:
*
*
* cs1
A L
* matches 0 1
* cs2
A L
*
*
* The notation in the matches row is meant to indicate that the
* A
at index 0
in cs1
is
* matched to the A
at index 0
in
* cs2
. Similarlty for the L
at index 1 in
* cs1
, which matches the L
at index 1 in
* cs2
. This results in matches(AL,AL) = 2
.
* There are no transpositions, so the Jaro distance is just:
*
*
* jaroProximity(AL,AL) = 1/3*(2/2 + 2/2 + (2-0)/2) = 1.0
*
* Applying the Winkler modification yields the same result:
*
*
* jaroWinklerProximity(AL,AL) = 1 + 0.1 * 2 * (1.0 - 1) = 1.0
*
* Next consider a more complex case, matching MARTHA
and
* MARHTA
. Here the match distance is max(6,6)/2 -
* 1 = 6/2 - 1 = 2
, allowing matching characters to be up to one
* character away. This yields the following alignment.
*
*
* cs1
* M A R T H A
*
* matches
* 0 1 2 4 3 5
*
* cs2
* M A R H T A
*
*
*
* Note that the T
at index 3 in the first string
* aligns with the T
at index 4 in the second string,
* whereas the H
at index 4 in the first string alings
* with the H
at index 3 in the second string. The
* strings that do not directly align are rendered in bold. This is
* an instance of a transposition. The number of half transpositions
* is determined by comparing the subsequences of the first and second
* string matched, namely MARTHA
and MARHTA
.
* There are two positions with mismatched characters, 3 and 4. This
* results in two half transpositions, or a single transposition, for
* a Jaro distance of:
*
*
* jaroProximity(MARTHA,MARHTA) = 1/3 * (6/6 + 6/6 + (6 - 1)/6) = 0.944
*
* Three initial characters match, MAR
, for a Jaro-Winkler
* distance of:
*
*
* jaroWinklerProximity(MARTHA,MARHTA) = 0.944 + 0.1 * 3 * (1.0 - 0.944) = 0.961
*
* Next, consider matching strings of different lengths, such as
* JONES
and JOHNSON
:
*
*
* cs1
* J O N E S
*
* matches
* 0 1 3 - 5
*
* cs2
* J O H N S O N
*
*
*
* The unmatched characters are rendered in italics. Here the
* subsequence of matched characters for the two strings are JONS
and
* JONS
, so there are no transpositions. Thus the Jaro
* distance is:
*
*
* jaroProximity(JONES,JOHNSON)
* = 1/3 * (4/5 + 4/7 + (4 - 0)/4) = 0.790
*
* The strings JONES
and JOHNSON
only
* match on their first two characters, JO
, so the
* Jaro-Winkler distance is:
*
*
* jaroWinklerProximity(JONES,JOHNSON)
* = .790 + 0.1 * 2 * (1.0 - .790) = 0.832
*
* We will now consider some artificial examples not drawn from
* (Winkler 2006). First, compare ABCVWXYZ
and
* CABVWXYZ
, which are of length 8, allowing alignments
* up to 8/4 - 1 = 3
positions away. This leads
* to the following alignment:
*
*
* cs1
* A B C V W X Y Z
*
* matches
* 1 2 0 3 4 5 6 7
*
* cs2
* C A B V W X Y Z
*
*
*
* Here, there are 8/8 matches in both strings. There are only
* three half-transpositions, in the first three characters,
* because no position of CAB
has an identical character
* to ABC
. This yields a total of one transposition,
* for a Jaro score of:
*
*
* jaroProximity(ABCVWXYZ,CABVWXYZ)
* = 1/3 * (8/8 + 8/8 + (8-1)/8) = .958
*
* There is no initial prefix match, so the Jaro-Winkler comparison
* produces the same result. Now consider matching
* ABCVWXYZ
to CBAWXYZ
. Here, the initial
* alignment is 2, 1, 0
, which yields only two half
* transpositions. Thus under the Jaro distance, ABC
is
* closer to CBA
than to CAB
, though due to
* integer rounding in computing the number of transpositions, this
* will only affect the final result if there is a further
* transposition in the strings.
*
*
Now consider the 10-character string ABCDUVWXYZ
.
* This allows matches up to 10/2 - 1 = 4
positions away.
* If matched against DABCUVWXYZ
, the result is 10
* matches, and 4 half transposes, or 2 transposes. Now consider
* matching ABCDUVWXYZ
against DBCAUVWXYZ
.
* Here, index 0 in the first string (A
) maps to
* index 3 in the second string, and index 3 in the first string
* (D
) maps to index 0 in the second string, but
* positions 1 and 2 (B
and C
) map to
* themselves. Thus when comparing the output, there are only two
* half transpositions, thus making the second example
* DBCAUVWXYZ
closer than DABCUVWXYZ
to the
* first string ABCDUVWXYZ
.
*
*
Note that the transposition count cannot be determined solely by
* the mapping. For instance, the string ABBBUVWXYZ
* matches BBBAUVWXYZ
with alignment 4, 0, 1, 2, 5, 6, 7,
* 8, 9, 1
. But there are only two half-transpositions, because
* only index 0 and index 3 mismatch in the subsequences of matching
* characters. Contrast this with ABCDUVWXYZ
matching
* DABCUVWXYZ
, which has the same alignment, but four
* half transpositions.
*
*
The greedy nature of the alignment phase in the Jaro-Winkler
* algorithm actually prevents the optimal alignments from being found
* in some cases. Consider the alignment of ABCAWXYZ
* with BCAWXYZ
:
*
*
* cs1
* A B C A W X Y Z
*
* matches
* 2 0 1 - 3 4 5 6
*
* cs2
* B C A W X Y Z
*
*
*
* Here the first pair of A
characters are matched,
* leading to three half transposes (the first three matched
* characters). A better scoring, though illegal, alignment would be
* the following, because it has the same number of matches, but no
* transposes:
*
*
* cs1
* A B C A W X Y Z
*
* matches
* - 0 1 2 3 4 5 6
*
* cs2
* B C A W X Y Z
*
*
*
* The illegal links are highlighted in yellow. Note that neither alignment
* matches in the initial character, so the Winkler adjustments do not apply.
*
*
Implementation Notes
*
* This class's implementation is a literal translation of the C
* algorithm used in William E. Winkler's papers and for the 1995
* U.S. Census Deduplication. The algorithm is the work of
* multiple authors and available from the folloiwng link:
*
*
* -
* Winkler, Bill, George McLaughlin, Matt Jaro and Marueen Lynch. 1994.
* strcmp95.c,
* Version 2. United States Census Bureau.
*
*
*
* Unlike the C version, the {@link
* #distance(CharSequence,CharSequence)} and {@link
* #proximity(CharSequence,CharSequence)} methods do not require its
* inputs to be padded with spaces. In addition, spaces are treated
* just like any other characters within the algorithm itself. There
* is also no case normalization in this class's version.
* Furthermore, the boundary conditions are changed so that two empty
* strings return a score of 1.0
rather than zero, as in
* the original algorithm.
*
*
Jaro's origial implementation is described in:
*
*
* - Jaro, Matthew A. 1989. Advances in Record-Linkage Methodology as
Applied to Matching the 1985 Census of Tampa, Florida. Journal of the
American Statistical Association 84(406):414--420.
*
*
* Winkler's modified algorithm, along with applications in record
* linkage, are described in the following highly readable survey
* article:
*
*
* -
* Winkler, William E. 2006.
* Overview of
* Record Linkage and Current Research Directions.
* Statistical Research Division, U.S. Census Bureau.
*
*
*
* This document provides test cases in Table 6, which are the basis
* for the unit tests for this class (though note the three 0.0
* results in the table do not agree with the return results of
* strcmp95.c
or the results of this class, which matches
* strcmp95.c
). The description of the matching
* procedure above is based on the actual strcmp95
code,
* the boundary conditions of which are not obvious from the text
* descriptions in the literature. An additional difference is that
* strcmp95
, but not the algorithms in Winkler's papers
* nor the algorithm in this class, provides the possibility of
* partial matches with similar-sounding characters
* (e.g. c
and k
).
*
* Acknowledgements
*
* We'd like to thank Bill Winkler for helping us understand the
* versions of the algorithm and providing the strcmp95.c
* code as a reference implementation.
*
* @author Bob Carpenter
* @version 3.0
* @since LingPipe2.4
*/
public class JaroWinklerDistance
implements Distance,
Proximity {
private final double mWeightThreshold;
private final int mNumChars;
/**
* Construct a basic Jaro string distance without the Winkler
* modifications. See the class documentation above for more information
* on the exact algorithm and its parameters.
*/
public JaroWinklerDistance() {
this(Double.POSITIVE_INFINITY,0);
}
/**
* Construct a Winkler-modified Jaro string distance with the
* specified weight threshold for refinement and an initial number
* of characters over which to reweight. See the class
* documentation above for more information on the exact algorithm
* and its parameters.
*/
public JaroWinklerDistance(double weightThreshold, int numChars) {
mNumChars = numChars;
mWeightThreshold = weightThreshold;
}
/**
* Returns the Jaro-Winkler distance between the specified character
* sequences. Teh distance is symmetric and will fall in the
* range 0
(perfect match) to 1
(no overlap).
* See the class definition above for formal definitions.
*
* This method is defined to be:
*
*
* distance(cSeq1,cSeq2) = 1 - proximity(cSeq1,cSeq2)
*
* @param cSeq1 First character sequence to compare.
* @param cSeq2 Second character sequence to compare.
* @return The Jaro-Winkler comparison value for the two character
* sequences.
*/
public double distance(CharSequence cSeq1, CharSequence cSeq2) {
return 1.0 - proximity(cSeq1,cSeq2);
}
/**
* Return the Jaro-Winkler comparison value between the specified
* character sequences. The comparison is symmetric and will fall
* in the range 0
(no match) to 1
* (perfect match)inclusive. See the class definition above for
* an exact definition of Jaro-Winkler string comparison.
*
* The method {@link #distance(CharSequence,CharSequence)} returns
* a distance measure that is one minus the comparison value.
*
* @param cSeq1 First character sequence to compare.
* @param cSeq2 Second character sequence to compare.
* @return The Jaro-Winkler comparison value for the two character
* sequences.
*/
public double proximity(CharSequence cSeq1, CharSequence cSeq2) {
int len1 = cSeq1.length();
int len2 = cSeq2.length();
if (len1 == 0)
return len2 == 0 ? 1.0 : 0.0;
int searchRange = Math.max(0,Math.max(len1,len2)/2 - 1);
boolean[] matched1 = new boolean[len1];
Arrays.fill(matched1,false);
boolean[] matched2 = new boolean[len2];
Arrays.fill(matched2,false);
int numCommon = 0;
for (int i = 0; i < len1; ++i) {
int start = Math.max(0,i-searchRange);
int end = Math.min(i+searchRange+1,len2);
for (int j = start; j < end; ++j) {
if (matched2[j]) continue;
if (cSeq1.charAt(i) != cSeq2.charAt(j))
continue;
matched1[i] = true;
matched2[j] = true;
++numCommon;
break;
}
}
if (numCommon == 0) return 0.0;
int numHalfTransposed = 0;
int j = 0;
for (int i = 0; i < len1; ++i) {
if (!matched1[i]) continue;
while (!matched2[j]) ++j;
if (cSeq1.charAt(i) != cSeq2.charAt(j))
++numHalfTransposed;
++j;
}
// System.out.println("numHalfTransposed=" + numHalfTransposed);
int numTransposed = numHalfTransposed/2;
// System.out.println("numCommon=" + numCommon
// + " numTransposed=" + numTransposed);
double numCommonD = numCommon;
double weight = (numCommonD/len1
+ numCommonD/len2
+ (numCommon - numTransposed)/numCommonD)/3.0;
if (weight <= mWeightThreshold) return weight;
int max = Math.min(mNumChars,Math.min(cSeq1.length(),cSeq2.length()));
int pos = 0;
while (pos < max && cSeq1.charAt(pos) == cSeq2.charAt(pos))
++pos;
if (pos == 0) return weight;
return weight + 0.1 * pos * (1.0 - weight);
}
/**
* A constant for the Jaro distance. The value is the same as
* would be returned by the nullary constructor
* JaroWinklerDistance()
.
*
*
Instances are thread safe, so this single distance instance
* may be used for all comparisons within an application.
*/
public static final JaroWinklerDistance JARO_DISTANCE
= new JaroWinklerDistance();
/**
* A constant for the Jaro-Winkler distance with defaults set as
* in Winkler's papers. The value is the same as would be
* returned by the nullary constructor
* JaroWinklerDistance(0.7,4)
.
*
*
Instances are thread safe, so this single distance instance
* may be used for all comparisons within an application.
*/
public static final JaroWinklerDistance JARO_WINKLER_DISTANCE
= new JaroWinklerDistance(0.70,4);
}