All Downloads are FREE. Search and download functionalities are using the official Maven repository.

com.aliasi.suffixarray.CharSuffixArray Maven / Gradle / Ivy

Go to download

This is the original Lingpipe: http://alias-i.com/lingpipe/web/download.html There were not made any changes to the source code.

There is a newer version: 4.1.2-JL1.0
Show newest version
/*
 * LingPipe v. 4.1.0
 * Copyright (C) 2003-2011 Alias-i
 *
 * This program is licensed under the Alias-i Royalty Free License
 * Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the Alias-i
 * Royalty Free License Version 1 for more details.
 *
 * You should have received a copy of the Alias-i Royalty Free License
 * Version 1 along with this program; if not, visit
 * http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact
 * Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211,
 * +1 (718) 290-9170.
 */

package com.aliasi.suffixarray;

import java.util.ArrayList;
import java.util.Arrays;
import java.util.Comparator;
import java.util.List;

/**
 * A {@code CharSuffixArray} implements a suffix array of characters.
 *
 * 

What is a Suffix Array?

* *

Given a string characters {@code cs}, the corresponding * suffix array is an array of {@code int} values of length equal to * {@code cs.length()}. The suffix array contains each integer between 0 * (inclusive) and the length of {@code cs} (exclusive). The suffix * array is sorted so that an index {@code m} appears before {@code n} * only if the string running from index {@code m} to the end of * {@code cs} (i.e., {@code cs.substring(m,cs.length-1)} is less * than the string running from index {@code n} to the end of {@code cs}, * using ordinary Java {@code String} comparison. * *

Example

* * The standard example is the suffix array for the array * of characters derived from the string {@code "abracadabra"}. * Here's the string itself, with its corresponding indexes: * *
 * abracadabra
 * 012345678901
 * 0         1
 * 
* * The suffixes and their starting indexes are * *
* * * * * * * * * * * * *
Char Array IndexSuffix
0abracadabra
1bracadabra
2racadabra
3acadabra
4cadabra
5adabra
6dabra
7abra
8bra
9ra
10a
* * The suffix array sorts the char array indexes based on the sort * order of the corresponding suffixes as strings. * *
* * * * * * * * * * * * *
Suffix IndexValueSuffix
010a
17abra
20abracadabra
33acadabra
45adabra
58bra
61bracadabra
74cadabra
86dabra
99ra
102racadabra
* * Thus the suffix array itself for {@code "abracadabra"} * is the {@code int[]}-type array * *
 * suffixArray("abracadbra")
 * = { 10, 7, 0, 3, 5, 8, 1, 4, 6, 9, 2 }
 * 
* * *

Constructing a Suffix Array

* * A suffix array is constructed from a character array and optionally * a maximum length at which to compare strings. If the maximum length * is less than the length of the array, strings are truncated to be at * most this length before comparison. This isn't a true suffix array, * but can be faster to create and will suffice for many applications. * The indexes will be sorted relative to the truncated strings, so they * will be in order up to the specified length. * *

Using Suffix Arrays

* * The primary application of suffix arrays is finding duplicate * substrings. The key idea is that two substrings of a string can be * represented as the prefixes of two suffixes of the string. For * instance, the example above, {@code "abracadabra"}, has two * instances of the substring {@code "br"}, corresponding to the * suffixes {@code "bracadabra"} starting at index 1 in the original * string and {@code "bra"} starting at index 8 in the original * string. Note that these two suffixes are adjacent in the suffix * array, occupying indexes 5 and 6 (in reverse order, because suffix * {@code "bra"} sorts before {@code "bracadabra"} as a string. * *

The method {@code prefixMatches(int)} will return all spans in * the suffix array that match up to a specific number of characters. * For instance, to find all substrings that match of length 3 from * suffix array {@code sa}, the method call {@code * sa.prefixMatches(3)} returns a list containing all spans as integer * arrays of type {@code int[]} with spans being represented from * start position (inclusive) to end position (exclusive), which would * here contain elements {1,3}, {5,7} indicating first * that the suffixes at positions 1 and 2, namely {@code "abra"} and * {@code "abracadabra"} start with the same three characters and * second that the suffixes at positions 5 and 6, {@code "bra"} and * {@code "bracadabra"}, start with the same three characters. Thus * we found all substrings of length 3 that occur more than once, * namely {@code "abr"} and {@code "bra"}, along with their positions. * * By using the suffix array itself, the positions in the underlying * string may be retrieved. For instance, the suffixes at positions * 1 and 2 in the suffix array start at positions * *

Thread Safety

* * A suffix array is thread safe after construction. * * @author Bob Carpenter * @version 4.1.0 * @since 4.0.2 */ public class CharSuffixArray { private final String mText; private final int[] mSuffixArray; private final int mMaxSuffixLength; /** * Construct a suffix array from the specified string, with no * bound on suffix length. * * @param text Underlying characters making up suffix array. */ public CharSuffixArray(String text) { this(text,Integer.MAX_VALUE); } /** * Construct a suffix array from the specified string, bounding * comparisons for sorting by the specified maximum suffix length. * *

This constructor is appropriate if no operations will be * subsequently performed on suffixes greater than the maximum * specified length. * * @param text Underlying text for suffix array. * @param maxSuffixLength Maximum suffix length for comparison. */ public CharSuffixArray(String text, int maxSuffixLength) { mText = text; mMaxSuffixLength = maxSuffixLength; Integer[] is = new Integer[text.length()]; for (int i = 0; i < text.length(); ++i) is[i] = i; Arrays.sort(is,new IndexComparator()); int[] suffixArray = new int[is.length]; for (int i = 0; i < is.length; ++i) suffixArray[i] = is[i]; mSuffixArray = suffixArray; } /** * Returns the underlying array of characters for this class. * * @return The text underlying this suffix array. */ public String text() { return mText; } /** * Returns the maximum suffix length for this character suffix * array. * * @return Maximum length of suffixes. */ public int maxSuffixLength() { return mMaxSuffixLength; } /** * Return the value of the suffix array at the specified index. * * @param idx Index into suffix array. * @return Index of suffix start position in the underlying * character array. */ public int suffixArray(int idx) { return mSuffixArray[idx]; } /** * Return the number of entries in this suffix array. * * @return Length of the suffix array. */ public int suffixArrayLength() { return mText.length(); } /** * Returns the string that starts at position {@code i} in * the character index and runs to the end of the character array * or up to the specified maximum length. * * @param csIndex Starting index in underlying array of characters. * @param maxLength Maximum length of returned string. * @return String starting at the specified index to the end of * the character array, truncated at max length. */ public String suffix(int csIndex, int maxLength) { return mText.substring(csIndex,end(csIndex,mText.length(),maxLength)); } // req not to overflow static int end(int csIndex, int textLength, int maxLength) { return (csIndex + (long) maxLength >= textLength) ? textLength : csIndex + maxLength; } /** * Returns a list of maximal spans of suffix array indexes which * refer to suffixes that share a prefix of at least the specified * minimum match length. * * @param minMatchLength Minimum number of characters required to * match. * @return The list of pairs of start (inclusive) and end * (exclsuive) positions in the suffix array that match up * to the specified minimum number of characters. */ public List prefixMatches(int minMatchLength) { List matches = new ArrayList(); for (int i = 0; i < mSuffixArray.length; ) { int j = suffixesMatchTo(i,minMatchLength); if (i + 1 != j) { matches.add(new int[] { i, j }); i = j; } else { ++i; } } return matches; } private int suffixesMatchTo(int i, int minMatchLength) { int index1 = mSuffixArray[i]; int j = i+1; for (; j < mSuffixArray.length; ++j) { int index2 = mSuffixArray[j]; if (!matchChars(index1,index2,minMatchLength)) break; } return j; } private boolean matchChars(int index1, int index2, int minMatches) { if (index1 + minMatches > mSuffixArray.length) return false; // not enough chars if (index2 + minMatches > mSuffixArray.length) return false; // not enough chars for (int k = 0; k < minMatches; ++k) if (mText.charAt(index1 + k) != mText.charAt(index2 + k)) return false; return true; } /** * A special separator character, used to mark the * boundaries of documents within the character array. * Suffixes are considered virtually to only run up * to a separator. * * The value of the separator char is {@code '\uFFFF'}. */ public static char SEPARATOR = '\uFFFF'; class IndexComparator implements Comparator { public int compare(Integer i, Integer j) { String cs = mText; for (int m = i, n = j, k = 0; k < mMaxSuffixLength; ++m, ++n, ++k) { if (m == cs.length() || cs.charAt(m) == SEPARATOR) { if (n == cs.length() || cs.charAt(n) == SEPARATOR) return 0; return -1; } if (n == cs.length() || cs.charAt(n) == SEPARATOR) return 1; if (cs.charAt(m) < cs.charAt(n)) return -1; if (cs.charAt(m) > cs.charAt(n)) return 1; } return 0; } } }





© 2015 - 2025 Weber Informatics LLC | Privacy Policy