All Downloads are FREE. Search and download functionalities are using the official Maven repository.

com.ibm.icu.text.SpoofChecker Maven / Gradle / Ivy

Go to download

International Component for Unicode for Java (ICU4J) is a mature, widely used Java library providing Unicode and Globalization support

There is a newer version: 76.1
Show newest version
/*
 ***************************************************************************
 * Copyright (C) 2008-2012, International Business Machines Corporation
 * and others. All Rights Reserved.
 ***************************************************************************
 *
 * Unicode Spoof Detection
 */
package com.ibm.icu.text;

import java.io.BufferedInputStream;
import java.io.ByteArrayInputStream;
import java.io.ByteArrayOutputStream;
import java.io.DataInputStream;
import java.io.DataOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.LineNumberReader;
import java.io.Reader;
import java.text.ParseException;
import java.util.Collections;
import java.util.Comparator;
import java.util.Hashtable;
import java.util.LinkedHashSet;
import java.util.Set;
import java.util.Vector;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

import com.ibm.icu.impl.Trie2;
import com.ibm.icu.impl.Trie2Writable;
import com.ibm.icu.lang.UCharacter;
import com.ibm.icu.lang.UCharacterCategory;
import com.ibm.icu.lang.UProperty;
import com.ibm.icu.lang.UScript;
import com.ibm.icu.util.ULocale;

/**
 *
 * Unicode Security and Spoofing Detection.
 *
 * 

This class is intended to check strings, typically * identifiers of some type, such as URLs, for the presence of * characters that are likely to be visually confusing - * for cases where the displayed form of an identifier may * not be what it appears to be. * *

Unicode Technical Report #36, * http://unicode.org/reports/tr36 and * Unicode Technical Standard #39, * http://unicode.org/reports/tr39 * "Unicode security considerations", give more background on * security and spoofing issues with Unicode identifiers. * The tests and checks provided by this module implement the recommendations * from these Unicode documents. * *

The tests available on identifiers fall into two general categories: *

    *
  • Single identifier tests. Check whether an identifier is * potentially confusable with any other string, or is suspicious * for other reasons.
  • *
  • Two identifier tests. Check whether two specific identifiers are confusable. * This does not consider whether either of strings is potentially * confusable with any string other than the exact one specified.
  • *
* *

The steps to perform confusability testing are *

    *
  • Create a SpoofChecker.Builder
  • *
  • Configure the Builder for the desired set of tests. The tests that will * be performed are specified by a set of SpoofCheck flags.
  • *
  • Build a SpoofChecker from the Builder.
  • *
  • Perform the checks using the pre-configured SpoofChecker. The results indicate * which (if any) of the selected tests have identified possible problems with the identifier. * Results are reported as a set of SpoofCheck flags; this mirrors the form in which * the set of tests to perform was originally specified to the SpoofChecker.
  • *
* *

A SpoofChecker instance may be used repeatedly to perform checks on any number * of identifiers. * *

Thread Safety: The methods on SpoofChecker objects are thread safe. * The test functions for checking a single identifier, or for testing * whether two identifiers are potentially confusable, may called concurrently * from multiple threads using the same SpoofChecker instance. * * *

Descriptions of the available checks. * *

When testing whether pairs of identifiers are confusable, with areConfusable() * the relevant tests are * *

    *
  • SINGLE_SCRIPT_CONFUSABLE: All of the characters from the two identifiers are * from a single script, and the two identifiers are visually confusable.
  • *
  • MIXED_SCRIPT_CONFUSABLE: At least one of the identifiers contains characters * from more than one script, and the two identifiers are visually confusable.
  • *
  • WHOLE_SCRIPT_CONFUSABLE: Each of the two identifiers is of a single script, but * the the two identifiers are from different scripts, and they are visually confusable.
  • *
* *

The safest approach is to enable all three of these checks as a group. * *

ANY_CASE is a modifier for the above tests. If the identifiers being checked can * be of mixed case and are used in a case-sensitive manner, this option should be specified. * *

If the identifiers being checked are used in a case-insensitive manner, and if they are * displayed to users in lower-case form only, the ANY_CASE option should not be * specified. Confusabality issues involving upper case letters will not be reported. * *

When performing tests on a single identifier, with the check() family of functions, * the relevant tests are: * *

    *
  • MIXED_SCRIPT_CONFUSABLE: the identifier contains characters from multiple * scripts, and there exists an identifier of a single script that is visually confusable.
  • *
  • WHOLE_SCRIPT_CONFUSABLE: the identifier consists of characters from a single * script, and there exists a visually confusable identifier. * The visually confusable identifier also consists of characters from a single script. * but not the same script as the identifier being checked.
  • *
  • ANY_CASE: modifies the mixed script and whole script confusables tests. If * specified, the checks will find confusable characters of any case. * If this flag is not set, the test is performed assuming case folded identifiers.
  • *
  • SINGLE_SCRIPT: check that the identifier contains only characters from a * single script. (Characters from the common and inherited scripts are ignored.) * This is not a test for confusable identifiers
  • *
  • INVISIBLE: check an identifier for the presence of invisible characters, * such as zero-width spaces, or character sequences that are * likely not to display, such as multiple occurrences of the same * non-spacing mark. This check does not test the input string as a whole * for conformance to any particular syntax for identifiers.
  • *
  • CHAR_LIMIT: check that an identifier contains only characters from a specified set * of acceptable characters. See Builder.setAllowedChars() and * Builder.setAllowedLocales().
  • *
* *

Note on Scripts: *

Characters from the Unicode Scripts "Common" and "Inherited" are ignored when considering * the script of an identifier. Common characters include digits and symbols that * are normally used with text from many different scripts.
* * @stable ICU 4.6 */ public class SpoofChecker { /** * Constants for the kinds of checks that USpoofChecker can perform. These values are used both to select the set of * checks that will be performed, and to report results from the check function. * */ /** * Single script confusable test. When testing whether two identifiers are confusable, report that they are if both * are from the same script and they are visually confusable. Note: this test is not applicable to a check of a * single identifier. * * @stable ICU 4.6 */ public static final int SINGLE_SCRIPT_CONFUSABLE = 1; /** * Mixed script confusable test. * * When checking a single identifier, report a problem if the identifier contains multiple scripts, and is also * confusable with some other identifier in a single script. * * When testing whether two identifiers are confusable, report that they are if the two IDs are visually confusable, * and and at least one contains characters from more than one script. * * @stable ICU 4.6 */ public static final int MIXED_SCRIPT_CONFUSABLE = 2; /** * Whole script confusable test. * * When checking a single identifier, report a problem if The identifier is of a single script, and there exists a * confusable identifier in another script. * * When testing whether two Identifiers are confusable, report that they are if each is of a single script, the * scripts of the two identifiers are different, and the identifiers are visually confusable. * * @stable ICU 4.6 */ public static final int WHOLE_SCRIPT_CONFUSABLE = 4; /** * Any Case Modifier for confusable identifier tests. * * When specified, consider all characters, of any case, when looking for confusables. If ANY_CASE is not specified, * identifiers being checked are assumed to have been case folded, and upper case conusable characters will not be * checked. * * @stable ICU 4.6 */ public static final int ANY_CASE = 8; /** * Check that an identifer contains only characters from a single script (plus chars from the common and inherited * scripts.) Applies to checks of a single identifier check only. * * @stable ICU 4.6 */ public static final int SINGLE_SCRIPT = 16; /** * Check an identifier for the presence of invisible characters, such as zero-width spaces, or character sequences * that are likely not to display, such as multiple occurrences of the same non-spacing mark. This check does not * test the input string as a whole for conformance to any particular syntax for identifiers. * * @stable ICU 4.6 */ public static final int INVISIBLE = 32; /** * Check that an identifier contains only characters from a specified set of acceptable characters. See * Builder.setAllowedChars() and Builder.setAllowedLocales(). * * @stable ICU 4.6 */ public static final int CHAR_LIMIT = 64; /** * Enable all spoof checks. * * @stable ICU 4.6 */ public static final int ALL_CHECKS = 0x7f; // Magic number for sanity checking spoof binary resource data. static final int MAGIC = 0x3845fdef; /** * private constructor: a SpoofChecker has to be built by the builder */ private SpoofChecker() { } /** * SpoofChecker Builder. To create a SpoofChecker, first instantiate a SpoofChecker.Builder, set the desired * checking options on the builder, then call the build() function to create a SpoofChecker instance. * * @stable ICU 4.6 */ public static class Builder { int fMagic; // Internal sanity check. int fChecks; // Bit vector of checks to perform. SpoofData fSpoofData; UnicodeSet fAllowedCharsSet; // The UnicodeSet of allowed characters. // for this Spoof Checker. Defaults to all chars. Set fAllowedLocales; // The list of allowed locales. /** * Constructor: Create a default Unicode Spoof Checker Builder, configured to perform all checks except for * LOCALE_LIMIT and CHAR_LIMIT. Note that additional checks may be added in the future, resulting in the changes * to the default checking behavior. * * @stable ICU 4.6 */ public Builder() { fMagic = MAGIC; fChecks = ALL_CHECKS; fSpoofData = null; fAllowedCharsSet = new UnicodeSet(0, 0x10ffff); fAllowedLocales = new LinkedHashSet(); } /** * Constructor: Create a Spoof Checker Builder, and set the configuration from an existing SpoofChecker. * * @param src * The existing checker. * @stable ICU 4.6 */ public Builder(SpoofChecker src) { fMagic = src.fMagic; fChecks = src.fChecks; fSpoofData = null; fAllowedCharsSet = src.fAllowedCharsSet.cloneAsThawed(); fAllowedLocales = new LinkedHashSet(); fAllowedLocales.addAll(src.fAllowedLocales); } /** * Create a SpoofChecker with current configuration. * * @return SpoofChecker * @stable ICU 4.6 */ public SpoofChecker build() { if (fSpoofData == null) { // read binary file try { fSpoofData = SpoofData.getDefault(); } catch (java.io.IOException e) { return null; } } if (!SpoofData.validateDataVersion(fSpoofData.fRawData)) { return null; } SpoofChecker result = new SpoofChecker(); result.fMagic = this.fMagic; result.fChecks = this.fChecks; result.fSpoofData = this.fSpoofData; result.fAllowedCharsSet = (UnicodeSet) (this.fAllowedCharsSet.clone()); result.fAllowedCharsSet.freeze(); result.fAllowedLocales = this.fAllowedLocales; return result; } /** * Specify the source form of the spoof data Spoof Checker. The Three inputs correspond to the Unicode data * files confusables.txt and confusablesWholeScript.txt as described in Unicode UAX 39. The syntax of the source * data is as described in UAX 39 for these files, and the content of these files is acceptable input. * * @param confusables * the Reader of confusable characters definitions, as found in file confusables.txt from * unicode.org. * @param confusablesWholeScript * the Reader of whole script confusables definitions, as found in the file * xonfusablesWholeScript.txt from unicode.org. * @throws ParseException * To report syntax errors in the input. * @stable ICU 4.6 */ public Builder setData(Reader confusables, Reader confusablesWholeScript) throws ParseException, java.io.IOException { // Set up a shell of a spoof detector, with empty data. fSpoofData = new SpoofData(); ByteArrayOutputStream bos = new ByteArrayOutputStream(); DataOutputStream os = new DataOutputStream(bos); // Compile the binary data from the source (text) format. ConfusabledataBuilder.buildConfusableData(fSpoofData, confusables); WSConfusableDataBuilder.buildWSConfusableData(fSpoofData, os, confusablesWholeScript); return this; } /** * Specify the set of checks that will be performed by the check functions of this Spoof Checker. * * @param checks * The set of checks that this spoof checker will perform. The value is an 'or' of the desired * checks. * @return self * @stable ICU 4.6 */ public Builder setChecks(int checks) { // Verify that the requested checks are all ones (bits) that // are acceptable, known values. if (0 != (checks & ~SpoofChecker.ALL_CHECKS)) { throw new IllegalArgumentException("Bad Spoof Checks value."); } this.fChecks = (checks & SpoofChecker.ALL_CHECKS); return this; } /** * Limit characters that are acceptable in identifiers being checked to those normally used with the languages * associated with the specified locales. Any previously specified list of locales is replaced by the new * settings. * * A set of languages is determined from the locale(s), and from those a set of acceptable Unicode scripts is * determined. Characters from this set of scripts, along with characters from the "common" and "inherited" * Unicode Script categories will be permitted. * * Supplying an empty string removes all restrictions; characters from any script will be allowed. * * The CHAR_LIMIT test is automatically enabled for this SpoofChecker when calling this function with a * non-empty list of locales. * * The Unicode Set of characters that will be allowed is accessible via the getAllowedChars() function. * setAllowedLocales() will replace any previously applied set of allowed characters. * * Adjustments, such as additions or deletions of certain classes of characters, can be made to the result of * setAllowedLocales() by fetching the resulting set with getAllowedChars(), manipulating it with the Unicode * Set API, then resetting the spoof detectors limits with setAllowedChars() * * @param locales * A Set of ULocales, from which the language and associated script are extracted. If the locales Set * is null, no restrictions will be placed on the allowed characters. * * @return self * @stable ICU 4.6 */ public Builder setAllowedLocales(Set locales) { fAllowedCharsSet.clear(); for (ULocale locale : locales) { // Add the script chars for this locale to the accumulating set // of allowed chars. addScriptChars(locale, fAllowedCharsSet); } // If our caller provided an empty list of locales, we disable the // allowed characters checking fAllowedLocales = new LinkedHashSet(); if (locales.size() == 0) { fAllowedCharsSet.add(0, 0x10ffff); fChecks &= ~CHAR_LIMIT; return this; } // Add all common and inherited characters to the set of allowed // chars. UnicodeSet tempSet = new UnicodeSet(); tempSet.applyIntPropertyValue(UProperty.SCRIPT, UScript.COMMON); fAllowedCharsSet.addAll(tempSet); tempSet.applyIntPropertyValue(UProperty.SCRIPT, UScript.INHERITED); fAllowedCharsSet.addAll(tempSet); // Store the updated spoof checker state. fAllowedLocales.addAll(locales); fChecks |= CHAR_LIMIT; return this; } // Add (union) to the UnicodeSet all of the characters for the scripts // used for the specified locale. Part of the implementation of // setAllowedLocales. private void addScriptChars(ULocale locale, UnicodeSet allowedChars) { int scripts[] = UScript.getCode(locale); UnicodeSet tmpSet = new UnicodeSet(); int i; for (i = 0; i < scripts.length; i++) { tmpSet.applyIntPropertyValue(UProperty.SCRIPT, scripts[i]); allowedChars.addAll(tmpSet); } } /** * Limit the acceptable characters to those specified by a Unicode Set. Any previously specified character limit * is is replaced by the new settings. This includes limits on characters that were set with the * setAllowedLocales() function. * * The CHAR_LIMIT test is automatically enabled for this SpoofChecker by this function. * * @param chars * A Unicode Set containing the list of characters that are permitted. The incoming set is cloned by * this function, so there are no restrictions on modifying or deleting the UnicodeSet after calling * this function. Note that this clears the allowedLocales set. * @return self * @stable ICU 4.6 */ public Builder setAllowedChars(UnicodeSet chars) { fAllowedCharsSet = chars.cloneAsThawed(); fAllowedLocales = new LinkedHashSet(); fChecks |= CHAR_LIMIT; return this; } // Structure for the Whole Script Confusable Data // See Unicode UAX-39, Unicode Security Mechanisms, for a description of the // Whole Script confusable data // // The data provides mappings from code points to a set of scripts // that contain characters that might be confused with the code point. // There are two mappings, one for lower case only, and one for characters // of any case. // // The actual data consists of a utrie2 to map from a code point to an offset, // and an array of UScriptSets (essentially bit maps) that is indexed // by the offsets obtained from the Trie. // // /* * Internal functions for compililing Whole Script confusable source data into its binary (runtime) form. The * binary data format is described in uspoof_impl.h */ private static class WSConfusableDataBuilder { // Regular expression for parsing a line from the Unicode file // confusablesWholeScript.txt // Example Lines: // 006F ; Latn; Deva; A # (o) LATIN SMALL LETTER O // 0048..0049 ; Latn; Grek; A # [2] (H..I) LATIN CAPITAL LETTER H..LATIN // CAPITAL LETTER I // | | | | // | | | |---- Which table, Any Case or Lower Case (A or L) // | | |----------Target script. We need this. // | |----------------Src script. Should match the script of the source // | code points. Beyond checking that, we don't keep it. // |--------------------------------Source code points or range. // // The expression will match _all_ lines, including erroneous lines. // The result of the parse is returned via the contents of the (match) // groups. static String parseExp = "(?m)" + // Multi-line mode "^([ \\t]*(?:#.*?)?)$" + // A blank or comment line. Matches Group // 1. "|^(?:" + // OR "\\s*([0-9A-F]{4,})(?:..([0-9A-F]{4,}))?\\s*;" + // Code point // range. Groups // 2 and 3. "\\s*([A-Za-z]+)\\s*;" + // The source script. Group 4. "\\s*([A-Za-z]+)\\s*;" + // The target script. Group 5. "\\s*(?:(A)|(L))" + // The table A or L. Group 6 or 7 "[ \\t]*(?:#.*?)?" + // Trailing commment ")$|" + // OR "^(.*?)$"; // An error line. Group 8. // Any line not matching the preceding // parts of the expression.will match // this, and thus be flagged as an error // Extract a regular expression match group into a char * string. // The group must contain only invariant characters. // Used for script names // static void readWholeFileToString(Reader reader, StringBuffer buffer) throws java.io.IOException { // Convert the user input data from UTF-8 to char (UTF-16) LineNumberReader lnr = new LineNumberReader(reader); do { String line = lnr.readLine(); if (line == null) { break; } buffer.append(line); buffer.append('\n'); } while (true); } // Build the Whole Script Confusable data // static void buildWSConfusableData(SpoofData fSpoofData, DataOutputStream os, Reader confusablesWS) throws ParseException, java.io.IOException { Pattern parseRegexp = null; StringBuffer input = new StringBuffer(); int lineNum = 0; Vector scriptSets = null; int rtScriptSetsCount = 2; Trie2Writable anyCaseTrie = new Trie2Writable(0, 0); Trie2Writable lowerCaseTrie = new Trie2Writable(0, 0); // The scriptSets vector provides a mapping from TRIE values to the set // of scripts. // // Reserved TRIE values: // 0: Code point has no whole script confusables. // 1: Code point is of script Common or Inherited. // These code points do not participate in whole script confusable // detection. // (This is logically equivalent to saying that they contain confusables // in all scripts) // // Because Trie values are indexes into the ScriptSets vector, pre-fill // vector positions 0 and 1 to avoid conflicts with the reserved values. scriptSets = new Vector(); scriptSets.addElement(null); scriptSets.addElement(null); readWholeFileToString(confusablesWS, input); parseRegexp = Pattern.compile(parseExp); // Zap any Byte Order Mark at the start of input. Changing it to a space // is benign // given the syntax of the input. if (input.charAt(0) == 0xfeff) { input.setCharAt(0, (char) 0x20); } // Parse the input, one line per iteration of this loop. Matcher matcher = parseRegexp.matcher(input); while (matcher.find()) { lineNum++; if (matcher.start(1) >= 0) { // this was a blank or comment line. continue; } if (matcher.start(8) >= 0) { // input file syntax error. throw new ParseException("ConfusablesWholeScript, line " + lineNum + ": Unrecognized input: " + matcher.group(), matcher.start()); } // Pick up the start and optional range end code points from the // parsed line. int startCodePoint = Integer.parseInt(matcher.group(2), 16); if (startCodePoint > 0x10ffff) { throw new ParseException("ConfusablesWholeScript, line " + lineNum + ": out of range code point: " + matcher.group(2), matcher.start(2)); } int endCodePoint = startCodePoint; if (matcher.start(3) >= 0) { endCodePoint = Integer.parseInt(matcher.group(3), 16); } if (endCodePoint > 0x10ffff) { throw new ParseException("ConfusablesWholeScript, line " + lineNum + ": out of range code point: " + matcher.group(3), matcher.start(3)); } // Extract the two script names from the source line. String srcScriptName = matcher.group(4); String targScriptName = matcher.group(5); int srcScript = UCharacter.getPropertyValueEnum(UProperty.SCRIPT, srcScriptName); int targScript = UCharacter.getPropertyValueEnum(UProperty.SCRIPT, targScriptName); if (srcScript == UScript.INVALID_CODE) { throw new ParseException("ConfusablesWholeScript, line " + lineNum + ": Invalid script code t: " + matcher.group(4), matcher.start(4)); } if (targScript == UScript.INVALID_CODE) { throw new ParseException("ConfusablesWholeScript, line " + lineNum + ": Invalid script code t: " + matcher.group(5), matcher.start(5)); } // select the table - (A) any case or (L) lower case only Trie2Writable table = anyCaseTrie; if (matcher.start(7) >= 0) { table = lowerCaseTrie; } // Build the set of scripts containing confusable characters for // the code point(s) specified in this input line. // Sanity check that the script of the source code point is the same // as the source script indicated in the input file. Failure of this // check is an error in the input file. // // Include the source script in the set (needed for Mixed Script // Confusable detection). // int cp; for (cp = startCodePoint; cp <= endCodePoint; cp++) { int setIndex = table.get(cp); BuilderScriptSet bsset = null; if (setIndex > 0) { assert (setIndex < scriptSets.size()); bsset = scriptSets.elementAt(setIndex); } else { bsset = new BuilderScriptSet(); bsset.codePoint = cp; bsset.trie = table; bsset.sset = new ScriptSet(); setIndex = scriptSets.size(); bsset.index = setIndex; bsset.rindex = 0; scriptSets.addElement(bsset); table.set(cp, setIndex); } bsset.sset.Union(targScript); bsset.sset.Union(srcScript); int cpScript = UScript.getScript(cp); if (cpScript != srcScript) { // status = U_INVALID_FORMAT_ERROR; throw new ParseException("ConfusablesWholeScript, line " + lineNum + ": Mismatch between source script and code point " + Integer.toString(cp, 16), matcher.start(5)); } } } // Eliminate duplicate script sets. At this point we have a separate // script set for every code point that had data in the input file. // // We eliminate underlying ScriptSet objects, not the BuildScriptSets // that wrap them // // printf("Number of scriptSets: %d\n", scriptSets.size()); { //int duplicateCount = 0; rtScriptSetsCount = 2; for (int outeri = 2; outeri < scriptSets.size(); outeri++) { BuilderScriptSet outerSet = scriptSets.elementAt(outeri); if (outerSet.index != outeri) { // This set was already identified as a duplicate. // It will not be allocated a position in the runtime array // of ScriptSets. continue; } outerSet.rindex = rtScriptSetsCount++; for (int inneri = outeri + 1; inneri < scriptSets.size(); inneri++) { BuilderScriptSet innerSet = scriptSets.elementAt(inneri); if (outerSet.sset.equals(innerSet.sset) && outerSet.sset != innerSet.sset) { innerSet.sset = outerSet.sset; innerSet.index = outeri; innerSet.rindex = outerSet.rindex; //duplicateCount++; } // But this doesn't get all. We need to fix the TRIE. } } // printf("Number of distinct script sets: %d\n", // rtScriptSetsCount); } // Update the Trie values to be reflect the run time script indexes // (after duplicate merging). // (Trie Values 0 and 1 are reserved, and the corresponding slots in // scriptSets // are unused, which is why the loop index starts at 2.) { for (int i = 2; i < scriptSets.size(); i++) { BuilderScriptSet bSet = scriptSets.elementAt(i); if (bSet.rindex != i) { bSet.trie.set(bSet.codePoint, bSet.rindex); } } } // For code points with script==Common or script==Inherited, // Set the reserved value of 1 into both Tries. These characters do not // participate // in Whole Script Confusable detection; this reserved value is the // means // by which they are detected. { UnicodeSet ignoreSet = new UnicodeSet(); ignoreSet.applyIntPropertyValue(UProperty.SCRIPT, UScript.COMMON); UnicodeSet inheritedSet = new UnicodeSet(); inheritedSet.applyIntPropertyValue(UProperty.SCRIPT, UScript.INHERITED); ignoreSet.addAll(inheritedSet); for (int rn = 0; rn < ignoreSet.getRangeCount(); rn++) { int rangeStart = ignoreSet.getRangeStart(rn); int rangeEnd = ignoreSet.getRangeEnd(rn); anyCaseTrie.setRange(rangeStart, rangeEnd, 1, true); lowerCaseTrie.setRange(rangeStart, rangeEnd, 1, true); } } // Serialize the data to the Spoof Detector { anyCaseTrie.toTrie2_16().serialize(os); lowerCaseTrie.toTrie2_16().serialize(os); fSpoofData.fRawData.fScriptSetsLength = rtScriptSetsCount; int rindex = 2; for (int i = 2; i < scriptSets.size(); i++) { BuilderScriptSet bSet = scriptSets.elementAt(i); if (bSet.rindex < rindex) { // We have already copied this script set to the serialized // data. continue; } assert (rindex == bSet.rindex); bSet.sset.output(os); rindex++; } } } // class BuilderScriptSet. Represents the set of scripts (Script Codes) // containing characters that are confusable with one specific // code point. private static class BuilderScriptSet { int codePoint; // The source code point. Trie2Writable trie; // Any-case or Lower-case Trie. // These Trie tables are the final result of the // build. This flag indicates which of the two // this set of data is for. ScriptSet sset; // The set of scripts itself. // Vectors of all B int index; // Index of this set in the Build Time vector // of script sets. int rindex; // Index of this set in the final (runtime) // array of sets. // its underlying sset. BuilderScriptSet() { codePoint = -1; trie = null; sset = null; index = 0; rindex = 0; } } } /* * ***************************************************************************** * Internal classes for compililing confusable data into its binary (runtime) form. * ***************************************************************************** */ // --------------------------------------------------------------------- // // buildConfusableData Compile the source confusable data, as defined by // the Unicode data file confusables.txt, into the binary // structures used by the confusable detector. // // The binary structures are described in uspoof_impl.h // // 1. parse the data, building 4 hash tables, one each for the SL, SA, ML and MA // tables. Each maps from a int to a String. // // 2. Sort all of the strings encountered by length, since they will need to // be stored in that order in the final string table. // // 3. Build a list of keys (UChar32s) from the four mapping tables. Sort the // list because that will be the ordering of our runtime table. // // 4. Generate the run time string table. This is generated before the key & value // tables because we need the string indexes when building those tables. // // 5. Build the run-time key and value tables. These are parallel tables, and // are built at the same time // class ConfusabledataBuilder // An instance of this class exists while the confusable data is being built // from source. // It encapsulates the intermediate data structures that are used for building. // It exports one static function, to do a confusable data build. private static class ConfusabledataBuilder { private SpoofData fSpoofData; private ByteArrayOutputStream bos; private DataOutputStream os; private Hashtable fSLTable; private Hashtable fSATable; private Hashtable fMLTable; private Hashtable fMATable; private UnicodeSet fKeySet; // A set of all keys (UChar32s) that go into the // four mapping tables. // The binary data is first assembled into the following four collections, // then output to the DataOutputStream os. private StringBuffer fStringTable; private Vector fKeyVec; private Vector fValueVec; private Vector fStringLengthsTable; private SPUStringPool stringPool; private Pattern fParseLine; private Pattern fParseHexNum; private int fLineNum; ConfusabledataBuilder(SpoofData spData, ByteArrayOutputStream bos) { this.bos = bos; this.os = new DataOutputStream(bos); fSpoofData = spData; fSLTable = new Hashtable(); fSATable = new Hashtable(); fMLTable = new Hashtable(); fMATable = new Hashtable(); fKeySet = new UnicodeSet(); fKeyVec = new Vector(); fValueVec = new Vector(); stringPool = new SPUStringPool(); } void build(Reader confusables) throws ParseException, java.io.IOException { StringBuffer fInput = new StringBuffer(); WSConfusableDataBuilder.readWholeFileToString(confusables, fInput); // Regular Expression to parse a line from Confusables.txt. The expression will match // any line. What was matched is determined by examining which capture groups have a match. // Capture Group 1: the source char // Capture Group 2: the replacement chars // Capture Group 3-6 the table type, SL, SA, ML, or MA // Capture Group 7: A blank or comment only line. // Capture Group 8: A syntactically invalid line. Anything that didn't match before. // Example Line from the confusables.txt source file: // "1D702 ; 006E 0329 ; SL # MATHEMATICAL ITALIC SMALL ETA ... " fParseLine = Pattern.compile("(?m)^[ \\t]*([0-9A-Fa-f]+)[ \\t]+;" + // Match the source char "[ \\t]*([0-9A-Fa-f]+" + // Match the replacement char(s) "(?:[ \\t]+[0-9A-Fa-f]+)*)[ \\t]*;" + // (continued) "\\s*(?:(SL)|(SA)|(ML)|(MA))" + // Match the table type "[ \\t]*(?:#.*?)?$" + // Match any trailing #comment "|^([ \\t]*(?:#.*?)?)$" + // OR match empty lines or lines with only a #comment "|^(.*?)$"); // OR match any line, which catches illegal lines. // Regular expression for parsing a hex number out of a space-separated list of them. // Capture group 1 gets the number, with spaces removed. fParseHexNum = Pattern.compile("\\s*([0-9A-F]+)"); // Zap any Byte Order Mark at the start of input. Changing it to a space // is benign given the syntax of the input. if (fInput.charAt(0) == 0xfeff) { fInput.setCharAt(0, (char) 0x20); } // Parse the input, one line per iteration of this loop. Matcher matcher = fParseLine.matcher(fInput); while (matcher.find()) { fLineNum++; if (matcher.start(7) >= 0) { // this was a blank or comment line. continue; } if (matcher.start(8) >= 0) { // input file syntax error. // status = U_PARSE_ERROR; throw new ParseException("Confusables, line " + fLineNum + ": Unrecognized Line: " + matcher.group(8), matcher.start(8)); } // We have a good input line. Extract the key character and mapping // string, and // put them into the appropriate mapping table. int keyChar = Integer.parseInt(matcher.group(1), 16); if (keyChar > 0x10ffff) { throw new ParseException("Confusables, line " + fLineNum + ": Bad code point: " + matcher.group(1), matcher.start(1)); } Matcher m = fParseHexNum.matcher(matcher.group(2)); StringBuilder mapString = new StringBuilder(); while (m.find()) { int c = Integer.parseInt(m.group(1), 16); if (keyChar > 0x10ffff) { throw new ParseException("Confusables, line " + fLineNum + ": Bad code point: " + Integer.toString(c, 16), matcher.start(2)); } mapString.appendCodePoint(c); } assert (mapString.length() >= 1); // Put the map (value) string into the string pool // This a little like a Java intern() - any duplicates will be // eliminated. SPUString smapString = stringPool.addString(mapString.toString()); // Add the char . string mapping to the appropriate table. Hashtable table = matcher.start(3) >= 0 ? fSLTable : matcher.start(4) >= 0 ? fSATable : matcher.start(5) >= 0 ? fMLTable : matcher.start(6) >= 0 ? fMATable : null; assert (table != null); table.put(keyChar, smapString); fKeySet.add(keyChar); } // Input data is now all parsed and collected. // Now create the run-time binary form of the data. // // This is done in two steps. First the data is assembled into vectors and strings, // for ease of construction, then the contents of these collections are dumped // into the actual raw-bytes data storage. // Build up the string array, and record the index of each string therein // in the (build time only) string pool. // Strings of length one are not entered into the strings array. // At the same time, build up the string lengths table, which records the // position in the string table of the first string of each length >= 4. // (Strings in the table are sorted by length) stringPool.sort(); fStringTable = new StringBuffer(); fStringLengthsTable = new Vector(); int previousStringLength = 0; int previousStringIndex = 0; int poolSize = stringPool.size(); int i; for (i = 0; i < poolSize; i++) { SPUString s = stringPool.getByIndex(i); int strLen = s.fStr.length(); int strIndex = fStringTable.length(); assert (strLen >= previousStringLength); if (strLen == 1) { // strings of length one do not get an entry in the string // table. // Keep the single string character itself here, which is the // same // convention that is used in the final run-time string table // index. s.fStrTableIndex = s.fStr.charAt(0); } else { if ((strLen > previousStringLength) && (previousStringLength >= 4)) { fStringLengthsTable.addElement(previousStringIndex); fStringLengthsTable.addElement(previousStringLength); } s.fStrTableIndex = strIndex; fStringTable.append(s.fStr); } previousStringLength = strLen; previousStringIndex = strIndex; } // Make the final entry to the string lengths table. // (it holds an entry for the _last_ string of each length, so adding // the // final one doesn't happen in the main loop because no longer string // was encountered.) if (previousStringLength >= 4) { fStringLengthsTable.addElement(previousStringIndex); fStringLengthsTable.addElement(previousStringLength); } // Construct the compile-time Key and Value tables // // For each key code point, check which mapping tables it applies to, // and create the final data for the key & value structures. // // The four logical mapping tables are conflated into one combined // table. // If multiple logical tables have the same mapping for some key, they // share a single entry in the combined table. // If more than one mapping exists for the same key code point, multiple // entries will be created in the table for (int range = 0; range < fKeySet.getRangeCount(); range++) { // It is an oddity of the UnicodeSet API that simply enumerating the // contained // code points requires a nested loop. for (int keyChar = fKeySet.getRangeStart(range); keyChar <= fKeySet.getRangeEnd(range); keyChar++) { addKeyEntry(keyChar, fSLTable, SpoofChecker.SL_TABLE_FLAG); addKeyEntry(keyChar, fSATable, SpoofChecker.SA_TABLE_FLAG); addKeyEntry(keyChar, fMLTable, SpoofChecker.ML_TABLE_FLAG); addKeyEntry(keyChar, fMATable, SpoofChecker.MA_TABLE_FLAG); } } // Put the assembled data into the flat runtime array outputData(); // All of the intermediate allocated data belongs to the // ConfusabledataBuilder object (this), and is deleted by Java GC. } // Add an entry to the key and value tables being built // input: data from SLTable, MATable, etc. // outut: entry added to fKeyVec and fValueVec // addKeyEntry Construction of the confusable Key and Mapping Values tables. // This is an intermediate point in the building process. // We already have the mappings in the hash tables fSLTable, etc. // This function builds corresponding run-time style table entries into // fKeyVec and fValueVec void addKeyEntry(int keyChar, // The key character Hashtable table, // The table, one of SATable, // MATable, etc. int tableFlag) { // One of SA_TABLE_FLAG, etc. SPUString targetMapping = table.get(keyChar); if (targetMapping == null) { // No mapping for this key character. // (This function is called for all four tables for each key char // that // is seen anywhere, so this no entry cases are very much expected.) return; } // Check whether there is already an entry with the correct mapping. // If so, simply set the flag in the keyTable saying that the existing // entry // applies to the table that we're doing now. boolean keyHasMultipleValues = false; int i; for (i = fKeyVec.size() - 1; i >= 0; i--) { int key = fKeyVec.elementAt(i); if ((key & 0x0ffffff) != keyChar) { // We have now checked all existing key entries for this key // char (if any) // without finding one with the same mapping. break; } String mapping = getMapping(i); if (mapping.equals(targetMapping.fStr)) { // The run time entry we are currently testing has the correct // mapping. // Set the flag in it indicating that it applies to the new // table also. key |= tableFlag; fKeyVec.setElementAt(key, i); return; } keyHasMultipleValues = true; } // Need to add a new entry to the binary data being built for this // mapping. // Includes adding entries to both the key table and the parallel values // table. int newKey = keyChar | tableFlag; if (keyHasMultipleValues) { newKey |= SpoofChecker.KEY_MULTIPLE_VALUES; } int adjustedMappingLength = targetMapping.fStr.length() - 1; if (adjustedMappingLength > 3) { adjustedMappingLength = 3; } newKey |= adjustedMappingLength << SpoofChecker.KEY_LENGTH_SHIFT; int newData = targetMapping.fStrTableIndex; fKeyVec.addElement(newKey); fValueVec.addElement(newData); // If the preceding key entry is for the same key character (but with a // different mapping) // set the multiple-values flag on it. if (keyHasMultipleValues) { int previousKeyIndex = fKeyVec.size() - 2; int previousKey = fKeyVec.elementAt(previousKeyIndex); previousKey |= SpoofChecker.KEY_MULTIPLE_VALUES; fKeyVec.setElementAt(previousKey, previousKeyIndex); } } // From an index into fKeyVec & fValueVec // get a String with the corresponding mapping. String getMapping(int index) { int key = fKeyVec.elementAt(index); int value = fValueVec.elementAt(index); int length = SpoofChecker.getKeyLength(key); int lastIndexWithLen; switch (length) { case 0: char[] cs = { (char) value }; return new String(cs); case 1: case 2: return fStringTable.substring(value, value + length + 1); // Note: +1 as optimization case 3: length = 0; int i; for (i = 0; i < fStringLengthsTable.size(); i += 2) { lastIndexWithLen = fStringLengthsTable.elementAt(i); if (value <= lastIndexWithLen) { length = fStringLengthsTable.elementAt(i + 1); break; } } assert (length >= 3); return fStringTable.substring(value, value + length); default: assert (false); } return ""; } // Populate the final binary output data array with the compiled data. // The confusable data has been compiled and stored in intermediate // collections and strings. Copy it from there to the final flat // binary array. void outputData() throws java.io.IOException { SpoofDataHeader rawData = fSpoofData.fRawData; // The Key Table // While copying the keys to the runtime array, // also sanity check that they are sorted. int numKeys = fKeyVec.size(); int i; int previousKey = 0; rawData.output(os); rawData.fCFUKeys = os.size(); assert (rawData.fCFUKeys == 128); rawData.fCFUKeysSize = numKeys; for (i = 0; i < numKeys; i++) { int key = fKeyVec.elementAt(i); assert ((key & 0x00ffffff) >= (previousKey & 0x00ffffff)); assert ((key & 0xff000000) != 0); os.writeInt(key); previousKey = key; } // The Value Table, parallels the key table int numValues = fValueVec.size(); assert (numKeys == numValues); rawData.fCFUStringIndex = os.size(); rawData.fCFUStringIndexSize = numValues; for (i = 0; i < numValues; i++) { int value = fValueVec.elementAt(i); assert (value < 0xffff); os.writeShort((short) value); } // The Strings Table. int stringsLength = fStringTable.length(); // Reserve an extra space so the string will be nul-terminated. This is // only a convenience, for when debugging; it is not needed otherwise. String strings = fStringTable.toString(); rawData.fCFUStringTable = os.size(); rawData.fCFUStringTableLen = stringsLength; for (i = 0; i < stringsLength; i++) { os.writeChar(strings.charAt(i)); } // The String Lengths Table // While copying into the runtime array do some sanity checks on the // values // Each complete entry contains two fields, an index and an offset. // Lengths should increase with each entry. // Offsets should be less than the size of the string table. int lengthTableLength = fStringLengthsTable.size(); int previousLength = 0; // Note: StringLengthsSize in the raw data is the number of complete // entries, // each consisting of a pair of 16 bit values, hence the divide by 2. rawData.fCFUStringLengthsSize = lengthTableLength / 2; rawData.fCFUStringLengths = os.size(); for (i = 0; i < lengthTableLength; i += 2) { int offset = fStringLengthsTable.elementAt(i); int length = fStringLengthsTable.elementAt(i + 1); assert (offset < stringsLength); assert (length < 40); assert (length > previousLength); os.writeShort((short) offset); os.writeShort((short) length); previousLength = length; } os.flush(); DataInputStream is = new DataInputStream(new ByteArrayInputStream(bos.toByteArray())); is.mark(Integer.MAX_VALUE); fSpoofData.initPtrs(is); } public static void buildConfusableData(SpoofData spData, Reader confusables) throws java.io.IOException, ParseException { ByteArrayOutputStream bos = new ByteArrayOutputStream(); ConfusabledataBuilder builder = new ConfusabledataBuilder(spData, bos); builder.build(confusables); } /* * ***************************************************************************** * Internal classes for compiling confusable data into its binary (runtime) form. * ***************************************************************************** */ // SPUString // Holds a string that is the result of one of the mappings defined // by the confusable mapping data (confusables.txt from Unicode.org) // Instances of SPUString exist during the compilation process only. private static class SPUString { String fStr; // The actual string. int fStrTableIndex; // Index into the final runtime data for this string. // (or, for length 1, the single string char itself, // there being no string table entry for it.) SPUString(String s) { fStr = s; fStrTableIndex = 0; } } // Comparison function for ordering strings in the string pool. // Compare by length first, then, within a group of the same length, // by code point order. // Conforms to the type signature for a USortComparator in uvector.h private static class SPUStringComparator implements Comparator { public int compare(SPUString sL, SPUString sR) { int lenL = sL.fStr.length(); int lenR = sR.fStr.length(); if (lenL < lenR) { return -1; } else if (lenL > lenR) { return 1; } else { return sL.fStr.compareTo(sR.fStr); } } } // String Pool A utility class for holding the strings that are the result of // the spoof mappings. These strings will utimately end up in the // run-time String Table. // This is sort of like a sorted set of strings, except that ICU's anemic // built-in collections don't support those, so it is implemented with a // combination of a uhash and a Vector. private static class SPUStringPool { public SPUStringPool() { fVec = new Vector(); fHash = new Hashtable(); } public int size() { return fVec.size(); } // Get the n-th string in the collection. public SPUString getByIndex(int index) { SPUString retString = fVec.elementAt(index); return retString; } // Add a string. Return the string from the table. // If the input parameter string is already in the table, delete the // input parameter and return the existing string. public SPUString addString(String src) { SPUString hashedString = fHash.get(src); if (hashedString == null) { hashedString = new SPUString(src); fHash.put(src, hashedString); fVec.addElement(hashedString); } return hashedString; } // Sort the contents; affects the ordering of getByIndex(). public void sort() { Collections.sort(fVec, new SPUStringComparator()); } private Vector fVec; // Elements are SPUString * private Hashtable fHash; // Key: Value: } } } /** * Get the set of checks that this Spoof Checker has been configured to perform. * * @return The set of checks that this spoof checker will perform. * @stable ICU 4.6 */ public int getChecks() { return fChecks; } /** * Get a list of locales for the scripts that are acceptable in strings to be checked. If no limitations on scripts * have been specified, an empty set will be returned. * * setAllowedChars() will reset the list of allowed locales to be empty. * * The returned set may not be identical to the originally specified set that is supplied to setAllowedLocales(); * the information other than languages from the originally specified locales may be omitted. * * @return A set of locales corresponding to the acceptable scripts. * * @stable ICU 4.6 */ public Set getAllowedLocales() { return fAllowedLocales; } /** * Get a UnicodeSet for the characters permitted in an identifier. This corresponds to the limits imposed by the Set * Allowed Characters functions. Limitations imposed by other checks will not be reflected in the set returned by * this function. * * The returned set will be frozen, meaning that it cannot be modified by the caller. * * @return A UnicodeSet containing the characters that are permitted by the CHAR_LIMIT test. * @stable ICU 4.6 */ public UnicodeSet getAllowedChars() { return fAllowedCharsSet; } /** * A struct-like class to hold the results of a Spoof Check operation. * Tells which check(s) have failed * and the position within the string where the failure was found. * * @stable ICU 4.6 */ public static class CheckResult { /** * Indicate which of the spoof check(s) has failed. The value is a bitwise OR * of the constants for the tests in question, SINGLE_SCRIPT_CONFUSABLE, * MIXED_SCRIPT_CONFUSABLE, WHOLE_SCRIPT_CONFUSABLE, and so on. * * @stable ICU 4.6 */ public int checks; /** * The index of the first string position that failed a check. * * @stable ICU 4.6 */ public int position; /** * Default constructor * @stable ICU 4.6 */ public CheckResult() { checks = 0; position = 0; } } /** * Check the specified string for possible security issues. The text to be checked will typically be an identifier * of some sort. The set of checks to be performed was specified when building the SpoofChecker. * * @param text * A String to be checked for possible security issues. * @param checkResult * Output parameter, indicates which specific tests failed. * May be null if the information is not wanted. * @return True there any issue is found with the input string. * @stable ICU 4.8 */ public boolean failsChecks(String text, CheckResult checkResult) { int length = text.length(); int result = 0; int failPos = Integer.MAX_VALUE; // A count of the number of non-Common or inherited scripts. // Needed for both the SINGLE_SCRIPT and the // WHOLE/MIXED_SCIRPT_CONFUSABLE tests. // Share the computation when possible. scriptCount == -1 means that we // haven't done it yet. int scriptCount = -1; if (0 != ((this.fChecks) & SINGLE_SCRIPT)) { scriptCount = this.scriptScan(text, checkResult); // no need to set failPos, it will be set to checkResult.position inside this.scriptScan // printf("scriptCount (clipped to 2) = %d\n", scriptCount); if (scriptCount >= 2) { // Note: scriptCount == 2 covers all cases of the number of // scripts >= 2 result |= SINGLE_SCRIPT; } } if (0 != (this.fChecks & CHAR_LIMIT)) { int i; int c; for (i = 0; i < length;) { // U16_NEXT(text, i, length, c); c = Character.codePointAt(text, i); i = Character.offsetByCodePoints(text, i, 1); if (!this.fAllowedCharsSet.contains(c)) { result |= CHAR_LIMIT; if (i < failPos) { failPos = i; } break; } } } if (0 != (this.fChecks & (WHOLE_SCRIPT_CONFUSABLE | MIXED_SCRIPT_CONFUSABLE | INVISIBLE))) { // These are the checks that need to be done on NFD input String nfdText = Normalizer.normalize(text, Normalizer.NFD, 0); if (0 != (this.fChecks & INVISIBLE)) { // scan for more than one occurence of the same non-spacing mark // in a sequence of non-spacing marks. int i; int c; int firstNonspacingMark = 0; boolean haveMultipleMarks = false; UnicodeSet marksSeenSoFar = new UnicodeSet(); // Set of combining marks in a // single combining sequence. for (i = 0; i < length;) { // U16_NEXT(nfdText, i, nfdLength, c); c = Character.codePointAt(nfdText, i); i = Character.offsetByCodePoints(nfdText, i, 1); if (Character.getType(c) != UCharacterCategory.NON_SPACING_MARK) { firstNonspacingMark = 0; if (haveMultipleMarks) { marksSeenSoFar.clear(); haveMultipleMarks = false; } continue; } if (firstNonspacingMark == 0) { firstNonspacingMark = c; continue; } if (!haveMultipleMarks) { marksSeenSoFar.add(firstNonspacingMark); haveMultipleMarks = true; } if (marksSeenSoFar.contains(c)) { // report the error, and stop scanning. // No need to find more than the first failure. result |= INVISIBLE; failPos = i; break; } marksSeenSoFar.add(c); } } if (0 != (this.fChecks & (WHOLE_SCRIPT_CONFUSABLE | MIXED_SCRIPT_CONFUSABLE))) { // The basic test is the same for both whole and mixed script // confusables. // Compute the set of scripts that every input character has a // confusable in. // For this computation an input character is always considered // to be // confusable with itself in its own script. // If the number of such scripts is two or more, and the input // consisted of // characters all from a single script, we have a whole script // confusable. // (The two scripts will be the original script and the one that // is confusable) // If the number of such scripts >= one, and the original input // contained characters from // more than one script, we have a mixed script confusable. (We // can transform // some of the characters, and end up with a visually similar // string all in // one script.) if (scriptCount == -1) { scriptCount = this.scriptScan(text, null); } ScriptSet scripts = new ScriptSet(); this.wholeScriptCheck(nfdText, scripts); int confusableScriptCount = scripts.countMembers(); // printf("confusableScriptCount = %d\n", // confusableScriptCount); if ((0 != (this.fChecks & WHOLE_SCRIPT_CONFUSABLE)) && confusableScriptCount >= 2 && scriptCount == 1) { result |= WHOLE_SCRIPT_CONFUSABLE; } if ((0 != (this.fChecks & MIXED_SCRIPT_CONFUSABLE)) && confusableScriptCount >= 1 && scriptCount > 1) { result |= MIXED_SCRIPT_CONFUSABLE; } } } if (checkResult != null) { checkResult.checks = result; if (failPos != Integer.MAX_VALUE) { checkResult.position = failPos; } } return (0 != result); } /** * Check the specified string for possible security issues. The text to be checked will typically be an identifier * of some sort. The set of checks to be performed was specified when building the SpoofChecker. * * @param text * A String to be checked for possible security issues. * @return True there any issue is found with the input string. * @stable ICU 4.8 */ public boolean failsChecks(String text) { return failsChecks(text, null); } /** * Check the whether two specified strings are visually confusable. The types of confusability to be tested - single * script, mixed script, or whole script - are determined by the check options set for the SpoofChecker. * * The tests to be performed are controlled by the flags SINGLE_SCRIPT_CONFUSABLE MIXED_SCRIPT_CONFUSABLE * WHOLE_SCRIPT_CONFUSABLE At least one of these tests must be selected. * * ANY_CASE is a modifier for the tests. Select it if the identifiers may be of mixed case. If identifiers are case * folded for comparison and display to the user, do not select the ANY_CASE option. * * * @param s1 * The first of the two strings to be compared for confusability. * @param s2 * The second of the two strings to be compared for confusability. * @return Non-zero if s1 and s1 are confusable. If not 0, the value will indicate the type(s) of confusability * found, as defined by spoof check test constants. * @stable ICU 4.6 */ public int areConfusable(String s1, String s2) { // // See section 4 of UAX 39 for the algorithm for checking whether two // strings are confusable, // and for definitions of the types (single, whole, mixed-script) of // confusables. // We only care about a few of the check flags. Ignore the others. // If no tests relavant to this function have been specified, signal an // error. // TODO: is this really the right thing to do? It's probably an error on // the caller's part, but logically we would just return 0 (no error). if ((this.fChecks & (SINGLE_SCRIPT_CONFUSABLE | MIXED_SCRIPT_CONFUSABLE | WHOLE_SCRIPT_CONFUSABLE)) == 0) { throw new IllegalArgumentException("No confusable checks are enabled."); } int flagsForSkeleton = this.fChecks & ANY_CASE; String s1Skeleton; String s2Skeleton; int result = 0; int s1ScriptCount = this.scriptScan(s1, null); int s2ScriptCount = this.scriptScan(s2, null); if (0 != (this.fChecks & SINGLE_SCRIPT_CONFUSABLE)) { // Do the Single Script compare. if (s1ScriptCount <= 1 && s2ScriptCount <= 1) { flagsForSkeleton |= SINGLE_SCRIPT_CONFUSABLE; s1Skeleton = getSkeleton(flagsForSkeleton, s1); s2Skeleton = getSkeleton(flagsForSkeleton, s2); if (s1Skeleton.length() == s2Skeleton.length() && s1Skeleton.equals(s2Skeleton)) { result |= SINGLE_SCRIPT_CONFUSABLE; } } } if (0 != (result & SINGLE_SCRIPT_CONFUSABLE)) { // If the two inputs are single script confusable they cannot also // be // mixed or whole script confusable, according to the UAX39 // definitions. // So we can skip those tests. return result; } // Optimization for whole script confusables test: two identifiers are // whole script confusable if // each is of a single script and they are mixed script confusable. boolean possiblyWholeScriptConfusables = s1ScriptCount <= 1 && s2ScriptCount <= 1 && (0 != (this.fChecks & WHOLE_SCRIPT_CONFUSABLE)); // Mixed Script Check if ((0 != (this.fChecks & MIXED_SCRIPT_CONFUSABLE)) || possiblyWholeScriptConfusables) { // For getSkeleton(), resetting the SINGLE_SCRIPT_CONFUSABLE flag // will get us // the mixed script table skeleton, which is what we want. // The Any Case / Lower Case bit in the skelton flags was set at the // top of the function. flagsForSkeleton &= ~SINGLE_SCRIPT_CONFUSABLE; s1Skeleton = getSkeleton(flagsForSkeleton, s1); s2Skeleton = getSkeleton(flagsForSkeleton, s2); if (s1Skeleton.length() == s2Skeleton.length() && s1Skeleton.equals(s2Skeleton)) { result |= MIXED_SCRIPT_CONFUSABLE; if (possiblyWholeScriptConfusables) { result |= WHOLE_SCRIPT_CONFUSABLE; } } } return result; } /** * Get the "skeleton" for an identifier string. Skeletons are a transformation of the input string; Two strings are * confusable if their skeletons are identical. See Unicode UAX 39 for additional information. * * Using skeletons directly makes it possible to quickly check whether an identifier is confusable with any of some * large set of existing identifiers, by creating an efficiently searchable collection of the skeletons. * * @param type * The type of skeleton, corresponding to which of the Unicode confusable data tables to use. The default * is Mixed-Script, Lowercase. Allowed options are SINGLE_SCRIPT_CONFUSABLE and ANY_CASE_CONFUSABLE. The * two flags may be ORed. * @param s * The input string whose skeleton will be genereated. * @return The output skeleton string. * * @stable ICU 4.6 */ public String getSkeleton(int type, String s) { // TODO: this function could be sped up a bit // Skip the input normalization when not needed, work from callers data. // It probably won't need normalization. if ((type & ~(SINGLE_SCRIPT_CONFUSABLE | ANY_CASE)) != 0) { // *status = U_ILLEGAL_ARGUMENT_ERROR; return null; } int tableMask = 0; switch (type) { case 0: tableMask = ML_TABLE_FLAG; break; case SINGLE_SCRIPT_CONFUSABLE: tableMask = SL_TABLE_FLAG; break; case ANY_CASE: tableMask = MA_TABLE_FLAG; break; case SINGLE_SCRIPT_CONFUSABLE | ANY_CASE: tableMask = SA_TABLE_FLAG; break; default: // *status = U_ILLEGAL_ARGUMENT_ERROR; return null; } // NFD transform of the user supplied input String nfdInput = Normalizer.normalize(s, Normalizer.NFD, 0); int normalizedLen = nfdInput.length(); // Apply the skeleton mapping to the NFD normalized input string // Accumulate the skeleton, possibly unnormalized, in a String. int inputIndex = 0; StringBuilder skelStr = new StringBuilder(); while (inputIndex < normalizedLen) { int c; c = Character.codePointAt(nfdInput, inputIndex); inputIndex = Character.offsetByCodePoints(nfdInput, inputIndex, 1); this.confusableLookup(c, tableMask, skelStr); } String result = skelStr.toString(); String normedResult; // Check the skeleton for NFD, normalize it if needed. // Unnormalized results should be very rare. if (!Normalizer.isNormalized(result, Normalizer.NFD, 0)) { normedResult = Normalizer.normalize(result, Normalizer.NFD, 0); result = normedResult; } return result; } /* * Append the confusable skeleton transform for a single code point to a StringBuilder. The string to be appended * will between 1 and 18 characters. * * This is the heart of the confusable skeleton generation implementation. * * @param tableMask bit flag specifying which confusable table to use. One of SL_TABLE_FLAG, MA_TABLE_FLAG, etc. */ private void confusableLookup(int inChar, int tableMask, StringBuilder dest) { // Binary search the spoof data key table for the inChar int low = 0; int mid = 0; int limit = fSpoofData.fRawData.fCFUKeysSize; int midc; boolean foundChar = false; // [low, limit), i.e low is inclusive, limit is exclusive do { int delta = (limit - low) / 2; mid = low + delta; midc = fSpoofData.fCFUKeys[mid] & 0x1fffff; if (inChar == midc) { foundChar = true; break; } else if (inChar < midc) { limit = mid; // limit is exclusive } else { // we have checked mid is not the char we looking for, the next // char // we want to check is (mid + 1) low = mid + 1; // low is inclusive } } while (low < limit); if (!foundChar) { // Char not found. It maps to itself. dest.appendCodePoint(inChar); return; } boolean foundKey = false; int keyFlags = fSpoofData.fCFUKeys[mid] & 0xff000000; if ((keyFlags & tableMask) == 0) { // We found the right key char, but the entry doesn't pertain to the // table we need. See if there is an adjacent key that does if (0 != (keyFlags & SpoofChecker.KEY_MULTIPLE_VALUES)) { int altMid; for (altMid = mid - 1; (fSpoofData.fCFUKeys[altMid] & 0x00ffffff) == inChar; altMid--) { keyFlags = fSpoofData.fCFUKeys[altMid] & 0xff000000; if (0 != (keyFlags & tableMask)) { mid = altMid; foundKey = true; break; } } if (!foundKey) { for (altMid = mid + 1; (fSpoofData.fCFUKeys[altMid] & 0x00ffffff) == inChar; altMid++) { keyFlags = fSpoofData.fCFUKeys[altMid] & 0xff000000; if (0 != (keyFlags & tableMask)) { mid = altMid; foundKey = true; break; } } } } if (!foundKey) { // No key entry for this char & table. // The input char maps to itself. dest.appendCodePoint(inChar); return; } } int stringLen = getKeyLength(keyFlags) + 1; int keyTableIndex = mid; // Value is either a char (for strings of length 1) or // an index into the string table (for longer strings) short value = fSpoofData.fCFUValues[keyTableIndex]; if (stringLen == 1) { dest.append((char) value); return; } // String length of 4 from the above lookup is used for all strings of // length >= 4. // For these, get the real length from the string lengths table, // which maps string table indexes to lengths. // All strings of the same length are stored contiguously in the string // table. // 'value' from the lookup above is the starting index for the desired // string. int ix; if (stringLen == 4) { int stringLengthsLimit = fSpoofData.fRawData.fCFUStringLengthsSize; for (ix = 0; ix < stringLengthsLimit; ix++) { if (fSpoofData.fCFUStringLengths[ix].fLastString >= value) { stringLen = fSpoofData.fCFUStringLengths[ix].fStrLength; break; } } assert (ix < stringLengthsLimit); } assert (value + stringLen <= fSpoofData.fRawData.fCFUStringTableLen); dest.append(fSpoofData.fCFUStrings, value, stringLen); return; } // WholeScript and MixedScript check implementation. // Implementation for Whole Script tests. // Return the test bit flag to be ORed into the eventual user return value // if a Spoof opportunity is detected. // Input text is already normalized to NFD // Return the set of scripts, each of which can represent something that is // confusable with the input text. The script of the input text // is included; input consisting of characters from a single script will // always produce a result consisting of a set containing that script. void wholeScriptCheck(CharSequence text, ScriptSet result) { int inputIdx = 0; int c; Trie2 table = (0 != (fChecks & ANY_CASE)) ? fSpoofData.fAnyCaseTrie : fSpoofData.fLowerCaseTrie; result.setAll(); while (inputIdx < text.length()) { c = Character.codePointAt(text, inputIdx); inputIdx = Character.offsetByCodePoints(text, inputIdx, 1); int index = table.get(c); if (index == 0) { // No confusables in another script for this char. // TODO: we should change the data to have sets with just the single script // bit for the script of this char. Gets rid of this special case. // Until then, grab the script from the char and intersect it with the set. int cpScript = UScript.getScript(c); assert (cpScript > UScript.INHERITED); result.intersect(cpScript); } else if (index == 1) { // Script == Common or Inherited. Nothing to do. } else { result.intersect(fSpoofData.fScriptSets[index]); } } } /** * Scan a string to determine how many scripts it includes. Ignore characters with script=Common and * scirpt=Inherited. * * @param text * The char text to be scanned * @param checkResult * Optional caller provided fill-in parameter. If not null, on return it will be filled. set to the first * input postion at which a second script was encountered, ignoring Common and Inherited. * @return the number of (non-common,inherited) scripts encountered, clipped to a max of two. * @internal */ int scriptScan(CharSequence text, CheckResult checkResult) { int inputIdx = 0; int c; int scriptCount = 0; int lastScript = UScript.INVALID_CODE; int sc = UScript.INVALID_CODE; while ((inputIdx < text.length()) && scriptCount < 2) { c = Character.codePointAt(text, inputIdx); inputIdx = Character.offsetByCodePoints(text, inputIdx, 1); sc = UScript.getScript(c); if (sc == UScript.COMMON || sc == UScript.INHERITED || sc == UScript.UNKNOWN) { continue; } // Temporary fix: fold Japanese and Korean into Han. // Names are allowed to mix these scripts. // A more general solution will follow later for characters that are // used with multiple scripts. if (sc == UScript.KATAKANA || sc == UScript.HIRAGANA || sc == UScript.HANGUL) { sc = UScript.HAN; } if (sc != lastScript) { scriptCount++; lastScript = sc; } } if (scriptCount == 2 && checkResult != null) { checkResult.position = inputIdx; } return scriptCount; } // Data Members private int fMagic; // Internal sanity check. private int fChecks; // Bit vector of checks to perform. private SpoofData fSpoofData; private Set fAllowedLocales; // The Set of allowed locales. private UnicodeSet fAllowedCharsSet; // The UnicodeSet of allowed characters. // for this Spoof Checker. Defaults to all chars. // // Confusable Mappings Data Structures // // For the confusable data, we are essentially implementing a map, // key: a code point // value: a string. Most commonly one char in length, but can be more. // // The keys are stored as a sorted array of 32 bit ints. // bits 0-23 a code point value // bits 24-31 flags // 24: 1 if entry applies to SL table // 25: 1 if entry applies to SA table // 26: 1 if entry applies to ML table // 27: 1 if entry applies to MA table // 28: 1 if there are multiple entries for this code point. // 29-30: length of value string, in UChars. // values are (1, 2, 3, other) // The key table is sorted in ascending code point order. (not on the // 32 bit int value, the flag bits do not participate in the sorting.) // // Lookup is done by means of a binary search in the key table. // // The corresponding values are kept in a parallel array of 16 bit ints. // If the value string is of length 1, it is literally in the value array. // For longer strings, the value array contains an index into the strings // table. // // String Table: // The strings table contains all of the value strings (those of length two // or greater) // concatentated together into one long char (UTF-16) array. // // The array is arranged by length of the strings - all strings of the same // length // are stored together. The sections are ordered by length of the strings - // all two char strings first, followed by all of the three Char strings, // etc. // // There is no nul character or other mark between adjacent strings. // // String Lengths table // The length of strings from 1 to 3 is flagged in the key table. // For strings of length 4 or longer, the string length table provides a // mapping between an index into the string table and the corresponding // length. // Strings of these lengths are rare, so lookup time is not an issue. // Each entry consists of // short index of the _last_ string with this length // short the length // Flag bits in the Key entries static final int SL_TABLE_FLAG = (1 << 24); static final int SA_TABLE_FLAG = (1 << 25); static final int ML_TABLE_FLAG = (1 << 26); static final int MA_TABLE_FLAG = (1 << 27); static final int KEY_MULTIPLE_VALUES = (1 << 28); static final int KEY_LENGTH_SHIFT = 29; static final int getKeyLength(int x) { return (((x) >> 29) & 3); } // --------------------------------------------------------------------------------------- // // Raw Binary Data Formats, as loaded from the ICU data file, // or as built by the builder. // // --------------------------------------------------------------------------------------- private static class SpoofDataHeader { int fMagic; // (0x8345fdef) byte[] fFormatVersion = new byte[4]; // Data Format. Same as the value in // class UDataInfo // if there is one associated with this data. int fLength; // Total lenght in bytes of this spoof data, // including all sections, not just the header. // The following four sections refer to data representing the confusable // data // from the Unicode.org data from "confusables.txt" int fCFUKeys; // byte offset to Keys table (from SpoofDataHeader *) int fCFUKeysSize; // number of entries in keys table (32 bits each) // TODO: change name to fCFUValues, for consistency. int fCFUStringIndex; // byte offset to String Indexes table int fCFUStringIndexSize; // number of entries in String Indexes table (16 bits each) // (number of entries must be same as in Keys table int fCFUStringTable; // byte offset of String table int fCFUStringTableLen; // length of string table (in 16 bit UChars) int fCFUStringLengths; // byte offset to String Lengths table int fCFUStringLengthsSize; // number of entries in lengths table. (2 x 16 bits each) // The following sections are for data from confusablesWholeScript.txt int fAnyCaseTrie; // byte offset to the serialized Any Case Trie int fAnyCaseTrieLength; // Length (bytes) of the serialized Any Case Trie int fLowerCaseTrie; // byte offset to the serialized Lower Case Trie int fLowerCaseTrieLength; // Length (bytes) of the serialized Lower Case Trie int fScriptSets; // byte offset to array of ScriptSets int fScriptSetsLength; // Number of ScriptSets (24 bytes each) // The following sections are for data from xidmodifications.txt int[] unused = new int[15]; // Padding, Room for Expansion public SpoofDataHeader() { } public SpoofDataHeader(DataInputStream dis) throws IOException { int i; fMagic = dis.readInt(); for (i = 0; i < fFormatVersion.length; i++) { fFormatVersion[i] = dis.readByte(); } fLength = dis.readInt(); fCFUKeys = dis.readInt(); fCFUKeysSize = dis.readInt(); fCFUStringIndex = dis.readInt(); fCFUStringIndexSize = dis.readInt(); fCFUStringTable = dis.readInt(); fCFUStringTableLen = dis.readInt(); fCFUStringLengths = dis.readInt(); fCFUStringLengthsSize = dis.readInt(); fAnyCaseTrie = dis.readInt(); fAnyCaseTrieLength = dis.readInt(); fLowerCaseTrie = dis.readInt(); fLowerCaseTrieLength = dis.readInt(); fScriptSets = dis.readInt(); fScriptSetsLength = dis.readInt(); for (i = 0; i < unused.length; i++) { unused[i] = dis.readInt(); } } public void output(DataOutputStream os) throws java.io.IOException { int i; os.writeInt(fMagic); for (i = 0; i < fFormatVersion.length; i++) { os.writeByte(fFormatVersion[i]); } os.writeInt(fLength); os.writeInt(fCFUKeys); os.writeInt(fCFUKeysSize); os.writeInt(fCFUStringIndex); os.writeInt(fCFUStringIndexSize); os.writeInt(fCFUStringTable); os.writeInt(fCFUStringTableLen); os.writeInt(fCFUStringLengths); os.writeInt(fCFUStringLengthsSize); os.writeInt(fAnyCaseTrie); os.writeInt(fAnyCaseTrieLength); os.writeInt(fLowerCaseTrie); os.writeInt(fLowerCaseTrieLength); os.writeInt(fScriptSets); os.writeInt(fScriptSetsLength); for (i = 0; i < unused.length; i++) { os.writeInt(unused[i]); } } } // ------------------------------------------------------------------------------------- // SpoofData // // A small class that wraps the raw (was memory mapped in the C world) spoof data. // Nothing in this class includes state that is specific to any particular // SpoofDetector object. // --------------------------------------------------------------------------------------- private static class SpoofData { // getDefault() - return a wrapper around the spoof data that is // baked into the default ICU data. // Load standard ICU spoof data. public static SpoofData getDefault() throws java.io.IOException { // TODO: Cache it. Lazy create, keep until cleanup. InputStream is = com.ibm.icu.impl.ICUData.getRequiredStream(com.ibm.icu.impl.ICUResourceBundle.ICU_BUNDLE + "/confusables.cfu"); SpoofData This = new SpoofData(is); return This; } // SpoofChecker Data constructor for use from data builder. // Initializes a new, empty data area that will be populated later. public SpoofData() { // The spoof header should already be sized to be a multiple of 16 // bytes. // Just in case it's not, round it up. fRawData = new SpoofDataHeader(); fRawData.fMagic = SpoofChecker.MAGIC; fRawData.fFormatVersion[0] = 1; fRawData.fFormatVersion[1] = 0; fRawData.fFormatVersion[2] = 0; fRawData.fFormatVersion[3] = 0; } // Constructor for use when creating from prebuilt default data. // A InputStream is what the ICU internal data loading functions provide. public SpoofData(InputStream is) throws java.io.IOException { // Seek past the ICU data header. // TODO: verify that the header looks good. DataInputStream dis = new DataInputStream(new BufferedInputStream(is)); dis.skip(0x80); assert (dis.markSupported()); dis.mark(Integer.MAX_VALUE); fRawData = new SpoofDataHeader(dis); initPtrs(dis); } // Check raw SpoofChecker Data Version compatibility. // Return true it looks good. static boolean validateDataVersion(SpoofDataHeader rawData) { if (rawData == null || rawData.fMagic != SpoofChecker.MAGIC || rawData.fFormatVersion[0] > 1 || rawData.fFormatVersion[1] > 0) { return false; } return true; } // build SpoofChecker from DataInputStream // read from binay data input stream // initialize the pointers from this object to the raw data. // Initialize the pointers to the various sections of the raw data. // // This function is used both during the Trie building process (multiple // times, as the individual data sections are added), and // during the opening of a SpoofChecker Checker from prebuilt data. // // The pointers for non-existent data sections (identified by an offset of // 0) are set to null. void initPtrs(DataInputStream dis) throws java.io.IOException { int i; fCFUKeys = null; fCFUValues = null; fCFUStringLengths = null; fCFUStrings = null; // the binary file from C world is memory-mapped, each section of data // is align-ed to 16-bytes boundary, to make the code more robust we call // reset()/skip() which essensially seek() to the correct offset. dis.reset(); dis.skip(fRawData.fCFUKeys); if (fRawData.fCFUKeys != 0) { fCFUKeys = new int[fRawData.fCFUKeysSize]; for (i = 0; i < fRawData.fCFUKeysSize; i++) { fCFUKeys[i] = dis.readInt(); } } dis.reset(); dis.skip(fRawData.fCFUStringIndex); if (fRawData.fCFUStringIndex != 0) { fCFUValues = new short[fRawData.fCFUStringIndexSize]; for (i = 0; i < fRawData.fCFUStringIndexSize; i++) { fCFUValues[i] = dis.readShort(); } } dis.reset(); dis.skip(fRawData.fCFUStringTable); if (fRawData.fCFUStringTable != 0) { fCFUStrings = new char[fRawData.fCFUStringTableLen]; for (i = 0; i < fRawData.fCFUStringTableLen; i++) { fCFUStrings[i] = dis.readChar(); } } dis.reset(); dis.skip(fRawData.fCFUStringLengths); if (fRawData.fCFUStringLengths != 0) { fCFUStringLengths = new SpoofStringLengthsElement[fRawData.fCFUStringLengthsSize]; for (i = 0; i < fRawData.fCFUStringLengthsSize; i++) { fCFUStringLengths[i] = new SpoofStringLengthsElement(); fCFUStringLengths[i].fLastString = dis.readShort(); fCFUStringLengths[i].fStrLength = dis.readShort(); } } dis.reset(); dis.skip(fRawData.fAnyCaseTrie); if (fAnyCaseTrie == null && fRawData.fAnyCaseTrie != 0) { fAnyCaseTrie = Trie2.createFromSerialized(dis); } dis.reset(); dis.skip(fRawData.fLowerCaseTrie); if (fLowerCaseTrie == null && fRawData.fLowerCaseTrie != 0) { fLowerCaseTrie = Trie2.createFromSerialized(dis); } dis.reset(); dis.skip(fRawData.fScriptSets); if (fRawData.fScriptSets != 0) { fScriptSets = new ScriptSet[fRawData.fScriptSetsLength]; for (i = 0; i < fRawData.fScriptSetsLength; i++) { fScriptSets[i] = new ScriptSet(dis); } } } SpoofDataHeader fRawData; // Confusable data int[] fCFUKeys; short[] fCFUValues; SpoofStringLengthsElement[] fCFUStringLengths; char[] fCFUStrings; // Whole Script Confusable Data Trie2 fAnyCaseTrie; Trie2 fLowerCaseTrie; ScriptSet[] fScriptSets; private static class SpoofStringLengthsElement { short fLastString; // index in string table of last string with this length short fStrLength; // Length of strings } } // ------------------------------------------------------------------------------- // // ScriptSet - Script code bit sets. Used with the whole script confusable data. // Used both at data build and at run time. // Could almost be a Java BitSet, except that the input and output would // be awkward. // // ------------------------------------------------------------------------------- private static class ScriptSet { public ScriptSet() { } public ScriptSet(DataInputStream dis) throws java.io.IOException { for (int j = 0; j < bits.length; j++) { bits[j] = dis.readInt(); } } public void output(DataOutputStream os) throws java.io.IOException { for (int i = 0; i < bits.length; i++) { os.writeInt(bits[i]); } } public boolean equals(ScriptSet other) { for (int i = 0; i < bits.length; i++) { if (bits[i] != other.bits[i]) { return false; } } return true; } public void Union(int script) { int index = script / 32; int bit = 1 << (script & 31); assert (index < bits.length * 4 * 4); bits[index] |= bit; } @SuppressWarnings("unused") public void Union(ScriptSet other) { for (int i = 0; i < bits.length; i++) { bits[i] |= other.bits[i]; } } public void intersect(ScriptSet other) { for (int i = 0; i < bits.length; i++) { bits[i] &= other.bits[i]; } } public void intersect(int script) { int index = script / 32; int bit = 1 << (script & 31); assert (index < bits.length * 4 * 4); int i; for (i = 0; i < index; i++) { bits[i] = 0; } bits[index] &= bit; for (i = index + 1; i < bits.length; i++) { bits[i] = 0; } } public void setAll() { for (int i = 0; i < bits.length; i++) { bits[i] = 0xffffffff; } } @SuppressWarnings("unused") public void resetAll() { for (int i = 0; i < bits.length; i++) { bits[i] = 0; } } public int countMembers() { // This bit counter is good for sparse numbers of '1's, which is // very much the case that we will usually have. int count = 0; for (int i = 0; i < bits.length; i++) { int x = bits[i]; while (x > 0) { count++; x &= (x - 1); // and off the least significant one bit. } } return count; } private int[] bits = new int[6]; } }




© 2015 - 2024 Weber Informatics LLC | Privacy Policy