All Downloads are FREE. Search and download functionalities are using the official Maven repository.

com.aliasi.spell.JaroWinklerDistance Maven / Gradle / Ivy

Go to download

This is the original Lingpipe: http://alias-i.com/lingpipe/web/download.html There were not made any changes to the source code.

There is a newer version: 4.1.2-JL1.0
Show newest version
/*
 * LingPipe v. 4.1.0
 * Copyright (C) 2003-2011 Alias-i
 *
 * This program is licensed under the Alias-i Royalty Free License
 * Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the Alias-i
 * Royalty Free License Version 1 for more details.
 *
 * You should have received a copy of the Alias-i Royalty Free License
 * Version 1 along with this program; if not, visit
 * http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact
 * Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211,
 * +1 (718) 290-9170.
 */

package com.aliasi.spell;

import com.aliasi.util.Distance;
import com.aliasi.util.Proximity;

import java.util.Arrays;

/**
 * The JaroWinklerDistance class implements the original
 * Jaro string comparison as well as Winkler's modifications.  As a
 * distance measure, Jaro-Winkler returns values between
 * 0 (exact string match) and 1 (no matching
 * characters).  Note that this is reversed from the original
 * definitions of Jaro and Winkler in order to produce a distance-like
 * ordering.  The original Jaro-Winkler string comparator returned
 * 1 for a perfect match and 0 for
 * complete mismatch; our method returns one minus the Jaro-Winkler
 * measure.
 *
 * 

The Jaro-Winkler distance measure was developed for name comparison * in the U.S. Census. It is designed to compae surnames to surnames * and given names to given names, not whole names to whole names. * There is no character-specific information in this implementation, * but assumptions are made about typical lengths and the significance * of initial matches that may not apply to all languages. * *

The easiest way to understand the Jaro measure and the Winkler * variants is procedurally. The Jaro measure involves two steps, * first to compute the number of "matches" and second to * compute the number of "transpositions". The Winkler * adjustment involves a final rescoring based on an exact match score * for the initial characters of both strings. * *

Formal Definition of Jaro-Winkler Distance

* *

Suppose we are comparing character sequences cs1 * and cs2. The Jaro-Winkler distance is defined by * the following steps. After the definitions, we consider some * examples. * *

Step 1: Matches: The match phase is a greedy alignment * step of characters in one string against the characters in another * string. The maximum distance (measured by array index) at which * characters may be matched is defined by: * *

 *   matchRange = max(cs1.length(), cs2.length()) / 2 - 1
* *

The match phase is a greedy alignment that proceeds character by * character through the first string, though the distance metric is * symmetric (that, is reversing the order of arguments does not * affect the result). For each character encountered in the first * string, it is matched to the first unaligned character in the * second string that is an exact character match. If there is no * such character within the match range window, the character is left * unaligned. * *

Step 2: Transpositions: After matching, the subsequence * of characters actually matched in both strings is extracted. These * subsequences will be the same length. The number of characters in * one string that do not line up (by index in the matched * subsequence) with identical characters in the other string is the * number of "half transpositions". The total number of * transpoisitons is the number of half transpositions divided by two, * rounding down. * *

The Jaro distance is then defined in terms of the number * of matching characters matches and the number * of transpositions, transposes: * *

 *   jaroProximity(cs1,cs2)
 *     = ( matches(cs1,cs2) / cs1.length()
 *         + matches(cs1,cs2) / cs2.length()
 *         + (matches(cs1,cs2) - transposes(cs1,cs2)) / matches(cs1,cs2) ) / 3
 *
 *   jaroDistance(cs1,cs2) = 1 - jaroProximity(cs1,cs2)
* *

In words, the measure is the average of three values; (a) the * percentage of the first string matched, (b) the percentage of the * second string matched, and (c) the percentage of matches that were * not transposed. * *

Step 3: Winkler Modification The Winkler modification to * the Jaro comparison, resulting in the Jaro-Winkler comparison, * boosts scores for strings that match character for character * initially. Let boostThreshold be the minimum score * for a string that gets boosted. This value was set to * 0.7 in Winkler's papers (see references below). If * the Jaro score is below the boost threshold, the Jaro score is * returned unadjusted. The second parameter for the Winkler * modification is the size of the initial prefix considered, * prefixSize. The prefix size was set to 4 * in Winkler's papers. Next, let * prefixMatch(cs1,cs2,prefixSize) be the number of * characters in the prefix of cs1 and cs2 * that exactly match (by original index), up to a maximum of * prefixSize. The modified distance is then defined to * be: * *

 *   jaroWinklerProximity(cs1,cs2,boostThreshold,prefixSize)
 *     = jaroMeasure(cs1,cs2) <= boostThreshold
 *     ? jaroMeasure(cs1,cs2)
 *     : jaroMeasure(cs1,cs2)
 *       + 0.1 * prefixMatch(cs1,cs2,prefixSize) * (1.0 - jaroDistance(cs1,cs2))
 *
 *   jaroWinklerDistance(cs1,cs2,boostThreshold,prefixSize)
 *     = 1 - jaroWinklerProximity(cs1,cs2,boostThreshold,prefixSize)
* *

Examples: We will present the alignment steps in the form * of tables, with offsets in the second string below the first string * positions that match. For a simple example, consider comparing the * given (nick)name AL to itself. Both strings are of * length 2. Thus the maximum match distance is max(2,2)/2 - 1 * = 0, meaning all matches must be exact. The matches are * illustrated in the following table: * *

* * * *
cs1AL
matches01
cs2AL
* *

The notation in the matches row is meant to indicate that the * A at index 0 in cs1 is * matched to the A at index 0 in * cs2. Similarlty for the L at index 1 in * cs1, which matches the L at index 1 in * cs2. This results in matches(AL,AL) = 2. * There are no transpositions, so the Jaro distance is just: * *

 *   jaroProximity(AL,AL) = 1/3*(2/2 + 2/2 + (2-0)/2) = 1.0
* *

Applying the Winkler modification yields the same result: * *

 *   jaroWinklerProximity(AL,AL) = 1 +  0.1 * 2 * (1.0 - 1) = 1.0
* *

Next consider a more complex case, matching MARTHA and * MARHTA. Here the match distance is max(6,6)/2 - * 1 = 6/2 - 1 = 2, allowing matching characters to be up to one * character away. This yields the following alignment. * *

* * * * * * * * * *
cs1MARTHA
matches012435
cs2MARHTA
* *

Note that the T at index 3 in the first string * aligns with the T at index 4 in the second string, * whereas the H at index 4 in the first string alings * with the H at index 3 in the second string. The * strings that do not directly align are rendered in bold. This is * an instance of a transposition. The number of half transpositions * is determined by comparing the subsequences of the first and second * string matched, namely MARTHA and MARHTA. * There are two positions with mismatched characters, 3 and 4. This * results in two half transpositions, or a single transposition, for * a Jaro distance of: * *

 *   jaroProximity(MARTHA,MARHTA) = 1/3 * (6/6 + 6/6 + (6 - 1)/6) = 0.944
* * Three initial characters match, MAR, for a Jaro-Winkler * distance of: * *
 *   jaroWinklerProximity(MARTHA,MARHTA) = 0.944 + 0.1 * 3 * (1.0 - 0.944) = 0.961
* *

Next, consider matching strings of different lengths, such as * JONES and JOHNSON: * *

* * * * * * * * * *
cs1JONES
matches013-5
cs2JOHNSON
* *

The unmatched characters are rendered in italics. Here the * subsequence of matched characters for the two strings are JONS and * JONS, so there are no transpositions. Thus the Jaro * distance is: * *

 *   jaroProximity(JONES,JOHNSON)
 *     = 1/3 * (4/5 + 4/7 + (4 - 0)/4) = 0.790
* *

The strings JONES and JOHNSON only * match on their first two characters, JO, so the * Jaro-Winkler distance is: * *

 *   jaroWinklerProximity(JONES,JOHNSON)
 *     = .790 + 0.1 * 2 * (1.0 - .790) = 0.832
* *

We will now consider some artificial examples not drawn from * (Winkler 2006). First, compare ABCVWXYZ and * CABVWXYZ, which are of length 8, allowing alignments * up to 8/4 - 1 = 3 positions away. This leads * to the following alignment: * *

* * * * * * * * * *
cs1ABCVWXYZ
matches12034567
cs2CABVWXYZ
* *

Here, there are 8/8 matches in both strings. There are only * three half-transpositions, in the first three characters, * because no position of CAB has an identical character * to ABC. This yields a total of one transposition, * for a Jaro score of: * *

 *   jaroProximity(ABCVWXYZ,CABVWXYZ)
 *     = 1/3 * (8/8 + 8/8 + (8-1)/8) = .958
* *

There is no initial prefix match, so the Jaro-Winkler comparison * produces the same result. Now consider matching * ABCVWXYZ to CBAWXYZ. Here, the initial * alignment is 2, 1, 0, which yields only two half * transpositions. Thus under the Jaro distance, ABC is * closer to CBA than to CAB, though due to * integer rounding in computing the number of transpositions, this * will only affect the final result if there is a further * transposition in the strings. * *

Now consider the 10-character string ABCDUVWXYZ. * This allows matches up to 10/2 - 1 = 4 positions away. * If matched against DABCUVWXYZ, the result is 10 * matches, and 4 half transposes, or 2 transposes. Now consider * matching ABCDUVWXYZ against DBCAUVWXYZ. * Here, index 0 in the first string (A) maps to * index 3 in the second string, and index 3 in the first string * (D) maps to index 0 in the second string, but * positions 1 and 2 (B and C) map to * themselves. Thus when comparing the output, there are only two * half transpositions, thus making the second example * DBCAUVWXYZ closer than DABCUVWXYZ to the * first string ABCDUVWXYZ. * *

Note that the transposition count cannot be determined solely by * the mapping. For instance, the string ABBBUVWXYZ * matches BBBAUVWXYZ with alignment 4, 0, 1, 2, 5, 6, 7, * 8, 9, 1. But there are only two half-transpositions, because * only index 0 and index 3 mismatch in the subsequences of matching * characters. Contrast this with ABCDUVWXYZ matching * DABCUVWXYZ, which has the same alignment, but four * half transpositions. * *

The greedy nature of the alignment phase in the Jaro-Winkler * algorithm actually prevents the optimal alignments from being found * in some cases. Consider the alignment of ABCAWXYZ * with BCAWXYZ: * *

* * * * * * * * * *
cs1ABCAWXYZ
matches201-3456
cs2BCAWXYZ 
* *

Here the first pair of A characters are matched, * leading to three half transposes (the first three matched * characters). A better scoring, though illegal, alignment would be * the following, because it has the same number of matches, but no * transposes: * *

* * * * * * * * * *
cs1ABCAWXYZ
matches-0123456
cs2BCAWXYZ 
* *

The illegal links are highlighted in yellow. Note that neither alignment * matches in the initial character, so the Winkler adjustments do not apply. * *

Implementation Notes

* *

This class's implementation is a literal translation of the C * algorithm used in William E. Winkler's papers and for the 1995 * U.S. Census Deduplication. The algorithm is the work of * multiple authors and available from the folloiwng link: * *

    *
  • * Winkler, Bill, George McLaughlin, Matt Jaro and Marueen Lynch. 1994. * strcmp95.c, * Version 2. United States Census Bureau. *
  • *
* *

Unlike the C version, the {@link * #distance(CharSequence,CharSequence)} and {@link * #proximity(CharSequence,CharSequence)} methods do not require its * inputs to be padded with spaces. In addition, spaces are treated * just like any other characters within the algorithm itself. There * is also no case normalization in this class's version. * Furthermore, the boundary conditions are changed so that two empty * strings return a score of 1.0 rather than zero, as in * the original algorithm. * *

Jaro's origial implementation is described in: * *

    *
  • Jaro, Matthew A. 1989. Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida. Journal of the American Statistical Association 84(406):414--420. *
* *

Winkler's modified algorithm, along with applications in record * linkage, are described in the following highly readable survey * article: * *

* * This document provides test cases in Table 6, which are the basis * for the unit tests for this class (though note the three 0.0 * results in the table do not agree with the return results of * strcmp95.c or the results of this class, which matches * strcmp95.c). The description of the matching * procedure above is based on the actual strcmp95 code, * the boundary conditions of which are not obvious from the text * descriptions in the literature. An additional difference is that * strcmp95, but not the algorithms in Winkler's papers * nor the algorithm in this class, provides the possibility of * partial matches with similar-sounding characters * (e.g. c and k). * *

Acknowledgements

* *

We'd like to thank Bill Winkler for helping us understand the * versions of the algorithm and providing the strcmp95.c * code as a reference implementation. * * @author Bob Carpenter * @version 3.0 * @since LingPipe2.4 */ public class JaroWinklerDistance implements Distance, Proximity { private final double mWeightThreshold; private final int mNumChars; /** * Construct a basic Jaro string distance without the Winkler * modifications. See the class documentation above for more information * on the exact algorithm and its parameters. */ public JaroWinklerDistance() { this(Double.POSITIVE_INFINITY,0); } /** * Construct a Winkler-modified Jaro string distance with the * specified weight threshold for refinement and an initial number * of characters over which to reweight. See the class * documentation above for more information on the exact algorithm * and its parameters. */ public JaroWinklerDistance(double weightThreshold, int numChars) { mNumChars = numChars; mWeightThreshold = weightThreshold; } /** * Returns the Jaro-Winkler distance between the specified character * sequences. Teh distance is symmetric and will fall in the * range 0 (perfect match) to 1 (no overlap). * See the class definition above for formal definitions. * *

This method is defined to be: * *

     *   distance(cSeq1,cSeq2) = 1 - proximity(cSeq1,cSeq2)
* * @param cSeq1 First character sequence to compare. * @param cSeq2 Second character sequence to compare. * @return The Jaro-Winkler comparison value for the two character * sequences. */ public double distance(CharSequence cSeq1, CharSequence cSeq2) { return 1.0 - proximity(cSeq1,cSeq2); } /** * Return the Jaro-Winkler comparison value between the specified * character sequences. The comparison is symmetric and will fall * in the range 0 (no match) to 1 * (perfect match)inclusive. See the class definition above for * an exact definition of Jaro-Winkler string comparison. * *

The method {@link #distance(CharSequence,CharSequence)} returns * a distance measure that is one minus the comparison value. * * @param cSeq1 First character sequence to compare. * @param cSeq2 Second character sequence to compare. * @return The Jaro-Winkler comparison value for the two character * sequences. */ public double proximity(CharSequence cSeq1, CharSequence cSeq2) { int len1 = cSeq1.length(); int len2 = cSeq2.length(); if (len1 == 0) return len2 == 0 ? 1.0 : 0.0; int searchRange = Math.max(0,Math.max(len1,len2)/2 - 1); boolean[] matched1 = new boolean[len1]; Arrays.fill(matched1,false); boolean[] matched2 = new boolean[len2]; Arrays.fill(matched2,false); int numCommon = 0; for (int i = 0; i < len1; ++i) { int start = Math.max(0,i-searchRange); int end = Math.min(i+searchRange+1,len2); for (int j = start; j < end; ++j) { if (matched2[j]) continue; if (cSeq1.charAt(i) != cSeq2.charAt(j)) continue; matched1[i] = true; matched2[j] = true; ++numCommon; break; } } if (numCommon == 0) return 0.0; int numHalfTransposed = 0; int j = 0; for (int i = 0; i < len1; ++i) { if (!matched1[i]) continue; while (!matched2[j]) ++j; if (cSeq1.charAt(i) != cSeq2.charAt(j)) ++numHalfTransposed; ++j; } // System.out.println("numHalfTransposed=" + numHalfTransposed); int numTransposed = numHalfTransposed/2; // System.out.println("numCommon=" + numCommon // + " numTransposed=" + numTransposed); double numCommonD = numCommon; double weight = (numCommonD/len1 + numCommonD/len2 + (numCommon - numTransposed)/numCommonD)/3.0; if (weight <= mWeightThreshold) return weight; int max = Math.min(mNumChars,Math.min(cSeq1.length(),cSeq2.length())); int pos = 0; while (pos < max && cSeq1.charAt(pos) == cSeq2.charAt(pos)) ++pos; if (pos == 0) return weight; return weight + 0.1 * pos * (1.0 - weight); } /** * A constant for the Jaro distance. The value is the same as * would be returned by the nullary constructor * JaroWinklerDistance(). * *

Instances are thread safe, so this single distance instance * may be used for all comparisons within an application. */ public static final JaroWinklerDistance JARO_DISTANCE = new JaroWinklerDistance(); /** * A constant for the Jaro-Winkler distance with defaults set as * in Winkler's papers. The value is the same as would be * returned by the nullary constructor * JaroWinklerDistance(0.7,4). * *

Instances are thread safe, so this single distance instance * may be used for all comparisons within an application. */ public static final JaroWinklerDistance JARO_WINKLER_DISTANCE = new JaroWinklerDistance(0.70,4); }





© 2015 - 2025 Weber Informatics LLC | Privacy Policy