All Downloads are FREE. Search and download functionalities are using the official Maven repository.

com.ibm.icu.text.Normalizer Maven / Gradle / Ivy

Go to download

International Component for Unicode for Java (ICU4J) is a mature, widely used Java library providing Unicode and Globalization support

There is a newer version: 76.1
Show newest version
/*
 *******************************************************************************
 * Copyright (C) 2000-2014, International Business Machines Corporation and
 * others. All Rights Reserved.
 *******************************************************************************
 */
package com.ibm.icu.text;
import java.nio.CharBuffer;
import java.text.CharacterIterator;

import com.ibm.icu.impl.Norm2AllModes;
import com.ibm.icu.impl.Normalizer2Impl;
import com.ibm.icu.impl.UCaseProps;
import com.ibm.icu.lang.UCharacter;
import com.ibm.icu.util.ICUCloneNotSupportedException;

/**
 * Unicode Normalization 
 *
 * 

Unicode normalization API

* * normalize transforms Unicode text into an equivalent composed or * decomposed form, allowing for easier sorting and searching of text. * normalize supports the standard normalization forms described in * * Unicode Standard Annex #15 — Unicode Normalization Forms. * * Characters with accents or other adornments can be encoded in * several different ways in Unicode. For example, take the character A-acute. * In Unicode, this can be encoded as a single character (the * "composed" form): * *
 *      00C1    LATIN CAPITAL LETTER A WITH ACUTE
 * 
* * or as two separate characters (the "decomposed" form): * *
 *      0041    LATIN CAPITAL LETTER A
 *      0301    COMBINING ACUTE ACCENT
 * 
* * To a user of your program, however, both of these sequences should be * treated as the same "user-level" character "A with acute accent". When you * are searching or comparing text, you must ensure that these two sequences are * treated equivalently. In addition, you must handle characters with more than * one accent. Sometimes the order of a character's combining accents is * significant, while in other cases accent sequences in different orders are * really equivalent. * * Similarly, the string "ffi" can be encoded as three separate letters: * *
 *      0066    LATIN SMALL LETTER F
 *      0066    LATIN SMALL LETTER F
 *      0069    LATIN SMALL LETTER I
 * 
* * or as the single character * *
 *      FB03    LATIN SMALL LIGATURE FFI
 * 
* * The ffi ligature is not a distinct semantic character, and strictly speaking * it shouldn't be in Unicode at all, but it was included for compatibility * with existing character sets that already provided it. The Unicode standard * identifies such characters by giving them "compatibility" decompositions * into the corresponding semantic characters. When sorting and searching, you * will often want to use these mappings. * * normalize helps solve these problems by transforming text into * the canonical composed and decomposed forms as shown in the first example * above. In addition, you can have it perform compatibility decompositions so * that you can treat compatibility characters the same as their equivalents. * Finally, normalize rearranges accents into the proper canonical * order, so that you do not have to worry about accent rearrangement on your * own. * * Form FCD, "Fast C or D", is also designed for collation. * It allows to work on strings that are not necessarily normalized * with an algorithm (like in collation) that works under "canonical closure", * i.e., it treats precomposed characters and their decomposed equivalents the * same. * * It is not a normalization form because it does not provide for uniqueness of * representation. Multiple strings may be canonically equivalent (their NFDs * are identical) and may all conform to FCD without being identical themselves. * * The form is defined such that the "raw decomposition", the recursive * canonical decomposition of each character, results in a string that is * canonically ordered. This means that precomposed characters are allowed for * as long as their decompositions do not need canonical reordering. * * Its advantage for a process like collation is that all NFD and most NFC texts * - and many unnormalized texts - already conform to FCD and do not need to be * normalized (NFD) for such a process. The FCD quick check will return YES for * most strings in practice. * * normalize(FCD) may be implemented with NFD. * * For more details on FCD see Unicode Technical Note #5 (Canonical Equivalence in Applications): * http://www.unicode.org/notes/tn5/#FCD * * ICU collation performs either NFD or FCD normalization automatically if * normalization is turned on for the collator object. Beyond collation and * string search, normalized strings may be useful for string equivalence * comparisons, transliteration/transcription, unique representations, etc. * * The W3C generally recommends to exchange texts in NFC. * Note also that most legacy character encodings use only precomposed forms and * often do not encode any combining marks by themselves. For conversion to such * character encodings the Unicode text needs to be normalized to NFC. * For more usage examples, see the Unicode Standard Annex. * * Note: The Normalizer class also provides API for iterative normalization. * While the setIndex() and getIndex() refer to indices in the * underlying Unicode input text, the next() and previous() methods * iterate through characters in the normalized output. * This means that there is not necessarily a one-to-one correspondence * between characters returned by next() and previous() and the indices * passed to and returned from setIndex() and getIndex(). * It is for this reason that Normalizer does not implement the CharacterIterator interface. * * @stable ICU 2.8 */ public final class Normalizer implements Cloneable { // The input text and our position in it private UCharacterIterator text; private Normalizer2 norm2; private Mode mode; private int options; // The normalization buffer is the result of normalization // of the source in [currentIndex..nextIndex[ . private int currentIndex; private int nextIndex; // A buffer for holding intermediate results private StringBuilder buffer; private int bufferPos; // Helper classes to defer loading of normalization data. private static final class ModeImpl { private ModeImpl(Normalizer2 n2) { normalizer2 = n2; } private final Normalizer2 normalizer2; } private static final class NFDModeImpl { private static final ModeImpl INSTANCE = new ModeImpl(Norm2AllModes.getNFCInstance().decomp); } private static final class NFKDModeImpl { private static final ModeImpl INSTANCE = new ModeImpl(Norm2AllModes.getNFKCInstance().decomp); } private static final class NFCModeImpl { private static final ModeImpl INSTANCE = new ModeImpl(Norm2AllModes.getNFCInstance().comp); } private static final class NFKCModeImpl { private static final ModeImpl INSTANCE = new ModeImpl(Norm2AllModes.getNFKCInstance().comp); } private static final class FCDModeImpl { private static final ModeImpl INSTANCE = new ModeImpl(Norm2AllModes.getFCDNormalizer2()); } private static final class Unicode32 { private static final UnicodeSet INSTANCE = new UnicodeSet("[:age=3.2:]").freeze(); } private static final class NFD32ModeImpl { private static final ModeImpl INSTANCE = new ModeImpl(new FilteredNormalizer2(Norm2AllModes.getNFCInstance().decomp, Unicode32.INSTANCE)); } private static final class NFKD32ModeImpl { private static final ModeImpl INSTANCE = new ModeImpl(new FilteredNormalizer2(Norm2AllModes.getNFKCInstance().decomp, Unicode32.INSTANCE)); } private static final class NFC32ModeImpl { private static final ModeImpl INSTANCE = new ModeImpl(new FilteredNormalizer2(Norm2AllModes.getNFCInstance().comp, Unicode32.INSTANCE)); } private static final class NFKC32ModeImpl { private static final ModeImpl INSTANCE = new ModeImpl(new FilteredNormalizer2(Norm2AllModes.getNFKCInstance().comp, Unicode32.INSTANCE)); } private static final class FCD32ModeImpl { private static final ModeImpl INSTANCE = new ModeImpl(new FilteredNormalizer2(Norm2AllModes.getFCDNormalizer2(), Unicode32.INSTANCE)); } /** * Options bit set value to select Unicode 3.2 normalization * (except NormalizationCorrections). * At most one Unicode version can be selected at a time. * @stable ICU 2.6 */ public static final int UNICODE_3_2=0x20; /** * Constant indicating that the end of the iteration has been reached. * This is guaranteed to have the same value as {@link UCharacterIterator#DONE}. * @stable ICU 2.8 */ public static final int DONE = UCharacterIterator.DONE; /** * Constants for normalization modes. *

* The Mode class is not intended for public subclassing. * Only the Mode constants provided by the Normalizer class should be used, * and any fields or methods should not be called or overridden by users. * @stable ICU 2.8 */ public static abstract class Mode { /** * @internal * @deprecated This API is ICU internal only. */ @Deprecated protected abstract Normalizer2 getNormalizer2(int options); } private static final class NONEMode extends Mode { protected Normalizer2 getNormalizer2(int options) { return Norm2AllModes.NOOP_NORMALIZER2; } } private static final class NFDMode extends Mode { protected Normalizer2 getNormalizer2(int options) { return (options&UNICODE_3_2) != 0 ? NFD32ModeImpl.INSTANCE.normalizer2 : NFDModeImpl.INSTANCE.normalizer2; } } private static final class NFKDMode extends Mode { protected Normalizer2 getNormalizer2(int options) { return (options&UNICODE_3_2) != 0 ? NFKD32ModeImpl.INSTANCE.normalizer2 : NFKDModeImpl.INSTANCE.normalizer2; } } private static final class NFCMode extends Mode { protected Normalizer2 getNormalizer2(int options) { return (options&UNICODE_3_2) != 0 ? NFC32ModeImpl.INSTANCE.normalizer2 : NFCModeImpl.INSTANCE.normalizer2; } } private static final class NFKCMode extends Mode { protected Normalizer2 getNormalizer2(int options) { return (options&UNICODE_3_2) != 0 ? NFKC32ModeImpl.INSTANCE.normalizer2 : NFKCModeImpl.INSTANCE.normalizer2; } } private static final class FCDMode extends Mode { protected Normalizer2 getNormalizer2(int options) { return (options&UNICODE_3_2) != 0 ? FCD32ModeImpl.INSTANCE.normalizer2 : FCDModeImpl.INSTANCE.normalizer2; } } /** * No decomposition/composition. * @stable ICU 2.8 */ public static final Mode NONE = new NONEMode(); /** * Canonical decomposition. * @stable ICU 2.8 */ public static final Mode NFD = new NFDMode(); /** * Compatibility decomposition. * @stable ICU 2.8 */ public static final Mode NFKD = new NFKDMode(); /** * Canonical decomposition followed by canonical composition. * @stable ICU 2.8 */ public static final Mode NFC = new NFCMode(); /** * Default normalization. * @stable ICU 2.8 */ public static final Mode DEFAULT = NFC; /** * Compatibility decomposition followed by canonical composition. * @stable ICU 2.8 */ public static final Mode NFKC =new NFKCMode(); /** * "Fast C or D" form. * @stable ICU 2.8 */ public static final Mode FCD = new FCDMode(); /** * Null operation for use with the {@link com.ibm.icu.text.Normalizer constructors} * and the static {@link #normalize normalize} method. This value tells * the Normalizer to do nothing but return unprocessed characters * from the underlying String or CharacterIterator. If you have code which * requires raw text at some times and normalized text at others, you can * use NO_OP for the cases where you want raw text, rather * than having a separate code path that bypasses Normalizer * altogether. *

* @see #setMode * @deprecated ICU 2.8. Use Nomalizer.NONE * @see #NONE */ @Deprecated public static final Mode NO_OP = NONE; /** * Canonical decomposition followed by canonical composition. Used with the * {@link com.ibm.icu.text.Normalizer constructors} and the static * {@link #normalize normalize} method to determine the operation to be * performed. *

* If all optional features (e.g. {@link #IGNORE_HANGUL}) are turned * off, this operation produces output that is in * Unicode Canonical * Form * C. *

* @see #setMode * @deprecated ICU 2.8. Use Normalier.NFC * @see #NFC */ @Deprecated public static final Mode COMPOSE = NFC; /** * Compatibility decomposition followed by canonical composition. * Used with the {@link com.ibm.icu.text.Normalizer constructors} and the static * {@link #normalize normalize} method to determine the operation to be * performed. *

* If all optional features (e.g. {@link #IGNORE_HANGUL}) are turned * off, this operation produces output that is in * Unicode Canonical * Form * KC. *

* @see #setMode * @deprecated ICU 2.8. Use Normalizer.NFKC * @see #NFKC */ @Deprecated public static final Mode COMPOSE_COMPAT = NFKC; /** * Canonical decomposition. This value is passed to the * {@link com.ibm.icu.text.Normalizer constructors} and the static * {@link #normalize normalize} * method to determine the operation to be performed. *

* If all optional features (e.g. {@link #IGNORE_HANGUL}) are turned * off, this operation produces output that is in * Unicode Canonical * Form * D. *

* @see #setMode * @deprecated ICU 2.8. Use Normalizer.NFD * @see #NFD */ @Deprecated public static final Mode DECOMP = NFD; /** * Compatibility decomposition. This value is passed to the * {@link com.ibm.icu.text.Normalizer constructors} and the static * {@link #normalize normalize} * method to determine the operation to be performed. *

* If all optional features (e.g. {@link #IGNORE_HANGUL}) are turned * off, this operation produces output that is in * Unicode Canonical * Form * KD. *

* @see #setMode * @deprecated ICU 2.8. Use Normalizer.NFKD * @see #NFKD */ @Deprecated public static final Mode DECOMP_COMPAT = NFKD; /** * Option to disable Hangul/Jamo composition and decomposition. * This option applies to Korean text, * which can be represented either in the Jamo alphabet or in Hangul * characters, which are really just two or three Jamo combined * into one visual glyph. Since Jamo takes up more storage space than * Hangul, applications that process only Hangul text may wish to turn * this option on when decomposing text. *

* The Unicode standard treates Hangul to Jamo conversion as a * canonical decomposition, so this option must be turned off if you * wish to transform strings into one of the standard * * Unicode Normalization Forms. *

* @see #setOption * @deprecated ICU 2.8. This option is no longer supported. */ @Deprecated public static final int IGNORE_HANGUL = 0x0001; /** * Result values for quickCheck(). * For details see Unicode Technical Report 15. * @stable ICU 2.8 */ public static final class QuickCheckResult{ //private int resultValue; private QuickCheckResult(int value) { //resultValue=value; } } /** * Indicates that string is not in the normalized format * @stable ICU 2.8 */ public static final QuickCheckResult NO = new QuickCheckResult(0); /** * Indicates that string is in the normalized format * @stable ICU 2.8 */ public static final QuickCheckResult YES = new QuickCheckResult(1); /** * Indicates it cannot be determined if string is in the normalized * format without further thorough checks. * @stable ICU 2.8 */ public static final QuickCheckResult MAYBE = new QuickCheckResult(2); /** * Option bit for compare: * Case sensitively compare the strings * @stable ICU 2.8 */ public static final int FOLD_CASE_DEFAULT = UCharacter.FOLD_CASE_DEFAULT; /** * Option bit for compare: * Both input strings are assumed to fulfill FCD conditions. * @stable ICU 2.8 */ public static final int INPUT_IS_FCD = 0x20000; /** * Option bit for compare: * Perform case-insensitive comparison. * @stable ICU 2.8 */ public static final int COMPARE_IGNORE_CASE = 0x10000; /** * Option bit for compare: * Compare strings in code point order instead of code unit order. * @stable ICU 2.8 */ public static final int COMPARE_CODE_POINT_ORDER = 0x8000; /** * Option value for case folding: * Use the modified set of mappings provided in CaseFolding.txt to handle dotted I * and dotless i appropriately for Turkic languages (tr, az). * @see UCharacter#FOLD_CASE_EXCLUDE_SPECIAL_I * @stable ICU 2.8 */ public static final int FOLD_CASE_EXCLUDE_SPECIAL_I = UCharacter.FOLD_CASE_EXCLUDE_SPECIAL_I; /** * Lowest-order bit number of compare() options bits corresponding to * normalization options bits. * * The options parameter for compare() uses most bits for * itself and for various comparison and folding flags. * The most significant bits, however, are shifted down and passed on * to the normalization implementation. * (That is, from compare(..., options, ...), * options>>COMPARE_NORM_OPTIONS_SHIFT will be passed on to the * internal normalization functions.) * * @see #compare * @stable ICU 2.6 */ public static final int COMPARE_NORM_OPTIONS_SHIFT = 20; //------------------------------------------------------------------------- // Iterator constructors //------------------------------------------------------------------------- /** * Creates a new Normalizer object for iterating over the * normalized form of a given string. *

* The options parameter specifies which optional * Normalizer features are to be enabled for this object. *

* @param str The string to be normalized. The normalization * will start at the beginning of the string. * * @param mode The normalization mode. * * @param opt Any optional features to be enabled. * Currently the only available option is {@link #UNICODE_3_2}. * If you want the default behavior corresponding to one of the * standard Unicode Normalization Forms, use 0 for this argument. * @stable ICU 2.6 */ public Normalizer(String str, Mode mode, int opt) { this.text = UCharacterIterator.getInstance(str); this.mode = mode; this.options=opt; norm2 = mode.getNormalizer2(opt); buffer = new StringBuilder(); } /** * Creates a new Normalizer object for iterating over the * normalized form of the given text. *

* @param iter The input text to be normalized. The normalization * will start at the beginning of the string. * * @param mode The normalization mode. * * @param opt Any optional features to be enabled. * Currently the only available option is {@link #UNICODE_3_2}. * If you want the default behavior corresponding to one of the * standard Unicode Normalization Forms, use 0 for this argument. * @stable ICU 2.6 */ public Normalizer(CharacterIterator iter, Mode mode, int opt) { this.text = UCharacterIterator.getInstance((CharacterIterator)iter.clone()); this.mode = mode; this.options = opt; norm2 = mode.getNormalizer2(opt); buffer = new StringBuilder(); } /** * Creates a new Normalizer object for iterating over the * normalized form of the given text. *

* @param iter The input text to be normalized. The normalization * will start at the beginning of the string. * * @param mode The normalization mode. * @param options The normalization options, ORed together (0 for no options). * @stable ICU 2.6 */ public Normalizer(UCharacterIterator iter, Mode mode, int options) { try { this.text = (UCharacterIterator)iter.clone(); this.mode = mode; this.options = options; norm2 = mode.getNormalizer2(options); buffer = new StringBuilder(); } catch (CloneNotSupportedException e) { throw new ICUCloneNotSupportedException(e); } } /** * Clones this Normalizer object. All properties of this * object are duplicated in the new object, including the cloning of any * {@link CharacterIterator} that was passed in to the constructor * or to {@link #setText(CharacterIterator) setText}. * However, the text storage underlying * the CharacterIterator is not duplicated unless the * iterator's clone method does so. * @stable ICU 2.8 */ public Object clone() { try { Normalizer copy = (Normalizer) super.clone(); copy.text = (UCharacterIterator) text.clone(); copy.mode = mode; copy.options = options; copy.norm2 = norm2; copy.buffer = new StringBuilder(buffer); copy.bufferPos = bufferPos; copy.currentIndex = currentIndex; copy.nextIndex = nextIndex; return copy; } catch (CloneNotSupportedException e) { throw new ICUCloneNotSupportedException(e); } } //-------------------------------------------------------------------------- // Static Utility methods //-------------------------------------------------------------------------- private static final Normalizer2 getComposeNormalizer2(boolean compat, int options) { return (compat ? NFKC : NFC).getNormalizer2(options); } private static final Normalizer2 getDecomposeNormalizer2(boolean compat, int options) { return (compat ? NFKD : NFD).getNormalizer2(options); } /** * Compose a string. * The string will be composed to according to the specified mode. * @param str The string to compose. * @param compat If true the string will be composed according to * NFKC rules and if false will be composed according to * NFC rules. * @return String The composed string * @stable ICU 2.8 */ public static String compose(String str, boolean compat) { return compose(str,compat,0); } /** * Compose a string. * The string will be composed to according to the specified mode. * @param str The string to compose. * @param compat If true the string will be composed according to * NFKC rules and if false will be composed according to * NFC rules. * @param options The only recognized option is UNICODE_3_2 * @return String The composed string * @stable ICU 2.6 */ public static String compose(String str, boolean compat, int options) { return getComposeNormalizer2(compat, options).normalize(str); } /** * Compose a string. * The string will be composed to according to the specified mode. * @param source The char array to compose. * @param target A char buffer to receive the normalized text. * @param compat If true the char array will be composed according to * NFKC rules and if false will be composed according to * NFC rules. * @param options The normalization options, ORed together (0 for no options). * @return int The total buffer size needed;if greater than length of * result, the output was truncated. * @exception IndexOutOfBoundsException if target.length is less than the * required length * @stable ICU 2.6 */ public static int compose(char[] source,char[] target, boolean compat, int options) { return compose(source, 0, source.length, target, 0, target.length, compat, options); } /** * Compose a string. * The string will be composed to according to the specified mode. * @param src The char array to compose. * @param srcStart Start index of the source * @param srcLimit Limit index of the source * @param dest The char buffer to fill in * @param destStart Start index of the destination buffer * @param destLimit End index of the destination buffer * @param compat If true the char array will be composed according to * NFKC rules and if false will be composed according to * NFC rules. * @param options The normalization options, ORed together (0 for no options). * @return int The total buffer size needed;if greater than length of * result, the output was truncated. * @exception IndexOutOfBoundsException if target.length is less than the * required length * @stable ICU 2.6 */ public static int compose(char[] src,int srcStart, int srcLimit, char[] dest,int destStart, int destLimit, boolean compat, int options) { CharBuffer srcBuffer = CharBuffer.wrap(src, srcStart, srcLimit - srcStart); CharsAppendable app = new CharsAppendable(dest, destStart, destLimit); getComposeNormalizer2(compat, options).normalize(srcBuffer, app); return app.length(); } /** * Decompose a string. * The string will be decomposed to according to the specified mode. * @param str The string to decompose. * @param compat If true the string will be decomposed according to NFKD * rules and if false will be decomposed according to NFD * rules. * @return String The decomposed string * @stable ICU 2.8 */ public static String decompose(String str, boolean compat) { return decompose(str,compat,0); } /** * Decompose a string. * The string will be decomposed to according to the specified mode. * @param str The string to decompose. * @param compat If true the string will be decomposed according to NFKD * rules and if false will be decomposed according to NFD * rules. * @param options The normalization options, ORed together (0 for no options). * @return String The decomposed string * @stable ICU 2.6 */ public static String decompose(String str, boolean compat, int options) { return getDecomposeNormalizer2(compat, options).normalize(str); } /** * Decompose a string. * The string will be decomposed to according to the specified mode. * @param source The char array to decompose. * @param target A char buffer to receive the normalized text. * @param compat If true the char array will be decomposed according to NFKD * rules and if false will be decomposed according to * NFD rules. * @return int The total buffer size needed;if greater than length of * result,the output was truncated. * @param options The normalization options, ORed together (0 for no options). * @exception IndexOutOfBoundsException if the target capacity is less than * the required length * @stable ICU 2.6 */ public static int decompose(char[] source,char[] target, boolean compat, int options) { return decompose(source, 0, source.length, target, 0, target.length, compat, options); } /** * Decompose a string. * The string will be decomposed to according to the specified mode. * @param src The char array to compose. * @param srcStart Start index of the source * @param srcLimit Limit index of the source * @param dest The char buffer to fill in * @param destStart Start index of the destination buffer * @param destLimit End index of the destination buffer * @param compat If true the char array will be decomposed according to NFKD * rules and if false will be decomposed according to * NFD rules. * @param options The normalization options, ORed together (0 for no options). * @return int The total buffer size needed;if greater than length of * result,the output was truncated. * @exception IndexOutOfBoundsException if the target capacity is less than * the required length * @stable ICU 2.6 */ public static int decompose(char[] src,int srcStart, int srcLimit, char[] dest,int destStart, int destLimit, boolean compat, int options) { CharBuffer srcBuffer = CharBuffer.wrap(src, srcStart, srcLimit - srcStart); CharsAppendable app = new CharsAppendable(dest, destStart, destLimit); getDecomposeNormalizer2(compat, options).normalize(srcBuffer, app); return app.length(); } /** * Normalizes a String using the given normalization operation. *

* The options parameter specifies which optional * Normalizer features are to be enabled for this operation. * Currently the only available option is {@link #UNICODE_3_2}. * If you want the default behavior corresponding to one of the standard * Unicode Normalization Forms, use 0 for this argument. *

* @param str the input string to be normalized. * @param mode the normalization mode * @param options the optional features to be enabled. * @return String the normalized string * @stable ICU 2.6 */ public static String normalize(String str, Mode mode, int options) { return mode.getNormalizer2(options).normalize(str); } /** * Normalize a string. * The string will be normalized according to the specified normalization * mode and options. * @param src The string to normalize. * @param mode The normalization mode; one of Normalizer.NONE, * Normalizer.NFD, Normalizer.NFC, Normalizer.NFKC, * Normalizer.NFKD, Normalizer.DEFAULT * @return the normalized string * @stable ICU 2.8 * */ public static String normalize(String src,Mode mode) { return normalize(src, mode, 0); } /** * Normalize a string. * The string will be normalized according to the specified normalization * mode and options. * @param source The char array to normalize. * @param target A char buffer to receive the normalized text. * @param mode The normalization mode; one of Normalizer.NONE, * Normalizer.NFD, Normalizer.NFC, Normalizer.NFKC, * Normalizer.NFKD, Normalizer.DEFAULT * @param options The normalization options, ORed together (0 for no options). * @return int The total buffer size needed;if greater than length of * result, the output was truncated. * @exception IndexOutOfBoundsException if the target capacity is less * than the required length * @stable ICU 2.6 */ public static int normalize(char[] source,char[] target, Mode mode, int options) { return normalize(source,0,source.length,target,0,target.length,mode, options); } /** * Normalize a string. * The string will be normalized according to the specified normalization * mode and options. * @param src The char array to compose. * @param srcStart Start index of the source * @param srcLimit Limit index of the source * @param dest The char buffer to fill in * @param destStart Start index of the destination buffer * @param destLimit End index of the destination buffer * @param mode The normalization mode; one of Normalizer.NONE, * Normalizer.NFD, Normalizer.NFC, Normalizer.NFKC, * Normalizer.NFKD, Normalizer.DEFAULT * @param options The normalization options, ORed together (0 for no options). * @return int The total buffer size needed;if greater than length of * result, the output was truncated. * @exception IndexOutOfBoundsException if the target capacity is * less than the required length * @stable ICU 2.6 */ public static int normalize(char[] src,int srcStart, int srcLimit, char[] dest,int destStart, int destLimit, Mode mode, int options) { CharBuffer srcBuffer = CharBuffer.wrap(src, srcStart, srcLimit - srcStart); CharsAppendable app = new CharsAppendable(dest, destStart, destLimit); mode.getNormalizer2(options).normalize(srcBuffer, app); return app.length(); } /** * Normalize a codepoint according to the given mode * @param char32 The input string to be normalized. * @param mode The normalization mode * @param options Options for use with exclusion set and tailored Normalization * The only option that is currently recognized is UNICODE_3_2 * @return String The normalized string * @stable ICU 2.6 * @see #UNICODE_3_2 */ public static String normalize(int char32, Mode mode, int options) { if(mode == NFD && options == 0) { String decomposition = Norm2AllModes.getNFCInstance().impl.getDecomposition(char32); if(decomposition == null) { decomposition = UTF16.valueOf(char32); } return decomposition; } return normalize(UTF16.valueOf(char32), mode, options); } /** * Convenience method to normalize a codepoint according to the given mode * @param char32 The input string to be normalized. * @param mode The normalization mode * @return String The normalized string * @stable ICU 2.6 */ public static String normalize(int char32, Mode mode) { return normalize(char32, mode, 0); } /** * Convenience method. * * @param source string for determining if it is in a normalized format * @param mode normalization format (Normalizer.NFC,Normalizer.NFD, * Normalizer.NFKC,Normalizer.NFKD) * @return Return code to specify if the text is normalized or not * (Normalizer.YES, Normalizer.NO or Normalizer.MAYBE) * @stable ICU 2.8 */ public static QuickCheckResult quickCheck(String source, Mode mode) { return quickCheck(source, mode, 0); } /** * Performing quick check on a string, to quickly determine if the string is * in a particular normalization format. * Three types of result can be returned Normalizer.YES, Normalizer.NO or * Normalizer.MAYBE. Result Normalizer.YES indicates that the argument * string is in the desired normalized format, Normalizer.NO determines that * argument string is not in the desired normalized format. A * Normalizer.MAYBE result indicates that a more thorough check is required, * the user may have to put the string in its normalized form and compare * the results. * * @param source string for determining if it is in a normalized format * @param mode normalization format (Normalizer.NFC,Normalizer.NFD, * Normalizer.NFKC,Normalizer.NFKD) * @param options Options for use with exclusion set and tailored Normalization * The only option that is currently recognized is UNICODE_3_2 * @return Return code to specify if the text is normalized or not * (Normalizer.YES, Normalizer.NO or Normalizer.MAYBE) * @stable ICU 2.6 */ public static QuickCheckResult quickCheck(String source, Mode mode, int options) { return mode.getNormalizer2(options).quickCheck(source); } /** * Convenience method. * * @param source Array of characters for determining if it is in a * normalized format * @param mode normalization format (Normalizer.NFC,Normalizer.NFD, * Normalizer.NFKC,Normalizer.NFKD) * @param options Options for use with exclusion set and tailored Normalization * The only option that is currently recognized is UNICODE_3_2 * @return Return code to specify if the text is normalized or not * (Normalizer.YES, Normalizer.NO or Normalizer.MAYBE) * @stable ICU 2.6 */ public static QuickCheckResult quickCheck(char[] source, Mode mode, int options) { return quickCheck(source, 0, source.length, mode, options); } /** * Performing quick check on a string, to quickly determine if the string is * in a particular normalization format. * Three types of result can be returned Normalizer.YES, Normalizer.NO or * Normalizer.MAYBE. Result Normalizer.YES indicates that the argument * string is in the desired normalized format, Normalizer.NO determines that * argument string is not in the desired normalized format. A * Normalizer.MAYBE result indicates that a more thorough check is required, * the user may have to put the string in its normalized form and compare * the results. * * @param source string for determining if it is in a normalized format * @param start the start index of the source * @param limit the limit index of the source it is equal to the length * @param mode normalization format (Normalizer.NFC,Normalizer.NFD, * Normalizer.NFKC,Normalizer.NFKD) * @param options Options for use with exclusion set and tailored Normalization * The only option that is currently recognized is UNICODE_3_2 * @return Return code to specify if the text is normalized or not * (Normalizer.YES, Normalizer.NO or * Normalizer.MAYBE) * @stable ICU 2.6 */ public static QuickCheckResult quickCheck(char[] source,int start, int limit, Mode mode,int options) { CharBuffer srcBuffer = CharBuffer.wrap(source, start, limit - start); return mode.getNormalizer2(options).quickCheck(srcBuffer); } /** * Test if a string is in a given normalization form. * This is semantically equivalent to source.equals(normalize(source, mode)). * * Unlike quickCheck(), this function returns a definitive result, * never a "maybe". * For NFD, NFKD, and FCD, both functions work exactly the same. * For NFC and NFKC where quickCheck may return "maybe", this function will * perform further tests to arrive at a true/false result. * @param src The input array of characters to be checked to see if * it is normalized * @param start The strart index in the source * @param limit The limit index in the source * @param mode the normalization mode * @param options Options for use with exclusion set and tailored Normalization * The only option that is currently recognized is UNICODE_3_2 * @return Boolean value indicating whether the source string is in the * "mode" normalization form * @stable ICU 2.6 */ public static boolean isNormalized(char[] src,int start, int limit, Mode mode, int options) { CharBuffer srcBuffer = CharBuffer.wrap(src, start, limit - start); return mode.getNormalizer2(options).isNormalized(srcBuffer); } /** * Test if a string is in a given normalization form. * This is semantically equivalent to source.equals(normalize(source, mode)). * * Unlike quickCheck(), this function returns a definitive result, * never a "maybe". * For NFD, NFKD, and FCD, both functions work exactly the same. * For NFC and NFKC where quickCheck may return "maybe", this function will * perform further tests to arrive at a true/false result. * @param str the input string to be checked to see if it is * normalized * @param mode the normalization mode * @param options Options for use with exclusion set and tailored Normalization * The only option that is currently recognized is UNICODE_3_2 * @see #isNormalized * @stable ICU 2.6 */ public static boolean isNormalized(String str, Mode mode, int options) { return mode.getNormalizer2(options).isNormalized(str); } /** * Convenience Method * @param char32 the input code point to be checked to see if it is * normalized * @param mode the normalization mode * @param options Options for use with exclusion set and tailored Normalization * The only option that is currently recognized is UNICODE_3_2 * * @see #isNormalized * @stable ICU 2.6 */ public static boolean isNormalized(int char32, Mode mode,int options) { return isNormalized(UTF16.valueOf(char32), mode, options); } /** * Compare two strings for canonical equivalence. * Further options include case-insensitive comparison and * code point order (as opposed to code unit order). * * Canonical equivalence between two strings is defined as their normalized * forms (NFD or NFC) being identical. * This function compares strings incrementally instead of normalizing * (and optionally case-folding) both strings entirely, * improving performance significantly. * * Bulk normalization is only necessary if the strings do not fulfill the * FCD conditions. Only in this case, and only if the strings are relatively * long, is memory allocated temporarily. * For FCD strings and short non-FCD strings there is no memory allocation. * * Semantically, this is equivalent to * strcmp[CodePointOrder](foldCase(NFD(s1)), foldCase(NFD(s2))) * where code point order and foldCase are all optional. * * @param s1 First source character array. * @param s1Start start index of source * @param s1Limit limit of the source * * @param s2 Second source character array. * @param s2Start start index of the source * @param s2Limit limit of the source * * @param options A bit set of options: * - FOLD_CASE_DEFAULT or 0 is used for default options: * Case-sensitive comparison in code unit order, and the input strings * are quick-checked for FCD. * * - INPUT_IS_FCD * Set if the caller knows that both s1 and s2 fulfill the FCD * conditions.If not set, the function will quickCheck for FCD * and normalize if necessary. * * - COMPARE_CODE_POINT_ORDER * Set to choose code point order instead of code unit order * * - COMPARE_IGNORE_CASE * Set to compare strings case-insensitively using case folding, * instead of case-sensitively. * If set, then the following case folding options are used. * * * @return <0 or 0 or >0 as usual for string comparisons * * @see #normalize * @see #FCD * @stable ICU 2.8 */ public static int compare(char[] s1, int s1Start, int s1Limit, char[] s2, int s2Start, int s2Limit, int options) { if( s1==null || s1Start<0 || s1Limit<0 || s2==null || s2Start<0 || s2Limit<0 || s1Limit0 as usual for string comparisons * * @see #normalize * @see #FCD * @stable ICU 2.8 */ public static int compare(String s1, String s2, int options) { return internalCompare(s1, s2, options); } /** * Compare two strings for canonical equivalence. * Further options include case-insensitive comparison and * code point order (as opposed to code unit order). * Convenience method. * * @param s1 First source string. * @param s2 Second source string. * * @param options A bit set of options: * - FOLD_CASE_DEFAULT or 0 is used for default options: * Case-sensitive comparison in code unit order, and the input strings * are quick-checked for FCD. * * - INPUT_IS_FCD * Set if the caller knows that both s1 and s2 fulfill the FCD * conditions. If not set, the function will quickCheck for FCD * and normalize if necessary. * * - COMPARE_CODE_POINT_ORDER * Set to choose code point order instead of code unit order * * - COMPARE_IGNORE_CASE * Set to compare strings case-insensitively using case folding, * instead of case-sensitively. * If set, then the following case folding options are used. * * @return <0 or 0 or >0 as usual for string comparisons * * @see #normalize * @see #FCD * @stable ICU 2.8 */ public static int compare(char[] s1, char[] s2, int options) { return internalCompare(CharBuffer.wrap(s1), CharBuffer.wrap(s2), options); } /** * Convenience method that can have faster implementation * by not allocating buffers. * @param char32a the first code point to be checked against the * @param char32b the second code point * @param options A bit set of options * @stable ICU 2.8 */ public static int compare(int char32a, int char32b, int options) { return internalCompare(UTF16.valueOf(char32a), UTF16.valueOf(char32b), options|INPUT_IS_FCD); } /** * Convenience method that can have faster implementation * by not allocating buffers. * @param char32a the first code point to be checked against * @param str2 the second string * @param options A bit set of options * @stable ICU 2.8 */ public static int compare(int char32a, String str2, int options) { return internalCompare(UTF16.valueOf(char32a), str2, options); } /* Concatenation of normalized strings --------------------------------- */ /** * Concatenate normalized strings, making sure that the result is normalized * as well. * * If both the left and the right strings are in * the normalization form according to "mode", * then the result will be * * * dest=normalize(left+right, mode) * * * With the input strings already being normalized, * this function will use next() and previous() * to find the adjacent end pieces of the input strings. * Only the concatenation of these end pieces will be normalized and * then concatenated with the remaining parts of the input strings. * * It is allowed to have dest==left to avoid copying the entire left string. * * @param left Left source array, may be same as dest. * @param leftStart start in the left array. * @param leftLimit limit in the left array (==length) * @param right Right source array. * @param rightStart start in the right array. * @param rightLimit limit in the right array (==length) * @param dest The output buffer; can be null if destStart==destLimit==0 * for pure preflighting. * @param destStart start in the destination array * @param destLimit limit in the destination array (==length) * @param mode The normalization mode. * @param options The normalization options, ORed together (0 for no options). * @return Length of output (number of chars) when successful or * IndexOutOfBoundsException * @exception IndexOutOfBoundsException whose message has the string * representation of destination capacity required. * @see #normalize * @see #next * @see #previous * @exception IndexOutOfBoundsException if target capacity is less than the * required length * @stable ICU 2.8 */ public static int concatenate(char[] left, int leftStart, int leftLimit, char[] right, int rightStart, int rightLimit, char[] dest, int destStart, int destLimit, Normalizer.Mode mode, int options) { if(dest == null) { throw new IllegalArgumentException(); } /* check for overlapping right and destination */ if (right == dest && rightStart < destLimit && destStart < rightLimit) { throw new IllegalArgumentException("overlapping right and dst ranges"); } /* allow left==dest */ StringBuilder destBuilder=new StringBuilder(leftLimit-leftStart+rightLimit-rightStart+16); destBuilder.append(left, leftStart, leftLimit-leftStart); CharBuffer rightBuffer=CharBuffer.wrap(right, rightStart, rightLimit-rightStart); mode.getNormalizer2(options).append(destBuilder, rightBuffer); int destLength=destBuilder.length(); if(destLength<=(destLimit-destStart)) { destBuilder.getChars(0, destLength, dest, destStart); return destLength; } else { throw new IndexOutOfBoundsException(Integer.toString(destLength)); } } /** * Concatenate normalized strings, making sure that the result is normalized * as well. * * If both the left and the right strings are in * the normalization form according to "mode", * then the result will be * * * dest=normalize(left+right, mode) * * * For details see concatenate * * @param left Left source string. * @param right Right source string. * @param mode The normalization mode. * @param options The normalization options, ORed together (0 for no options). * @return result * * @see #concatenate * @see #normalize * @see #next * @see #previous * @see #concatenate * @stable ICU 2.8 */ public static String concatenate(char[] left, char[] right,Mode mode, int options) { StringBuilder dest=new StringBuilder(left.length+right.length+16).append(left); return mode.getNormalizer2(options).append(dest, CharBuffer.wrap(right)).toString(); } /** * Concatenate normalized strings, making sure that the result is normalized * as well. * * If both the left and the right strings are in * the normalization form according to "mode", * then the result will be * * * dest=normalize(left+right, mode) * * * With the input strings already being normalized, * this function will use next() and previous() * to find the adjacent end pieces of the input strings. * Only the concatenation of these end pieces will be normalized and * then concatenated with the remaining parts of the input strings. * * @param left Left source string. * @param right Right source string. * @param mode The normalization mode. * @param options The normalization options, ORed together (0 for no options). * @return result * * @see #concatenate * @see #normalize * @see #next * @see #previous * @see #concatenate * @stable ICU 2.8 */ public static String concatenate(String left, String right, Mode mode, int options) { StringBuilder dest=new StringBuilder(left.length()+right.length()+16).append(left); return mode.getNormalizer2(options).append(dest, right).toString(); } /** * Gets the FC_NFKC closure value. * @param c The code point whose closure value is to be retrieved * @param dest The char array to receive the closure value * @return the length of the closure value; 0 if there is none * @stable ICU 3.8 */ public static int getFC_NFKC_Closure(int c,char[] dest) { String closure=getFC_NFKC_Closure(c); int length=closure.length(); if(length!=0 && dest!=null && length<=dest.length) { closure.getChars(0, length, dest, 0); } return length; } /** * Gets the FC_NFKC closure value. * @param c The code point whose closure value is to be retrieved * @return String representation of the closure value; "" if there is none * @stable ICU 3.8 */ public static String getFC_NFKC_Closure(int c) { // Compute the FC_NFKC_Closure on the fly: // We have the API for complete coverage of Unicode properties, although // this value by itself is not useful via API. // (What could be useful is a custom normalization table that combines // case folding and NFKC.) // For the derivation, see Unicode's DerivedNormalizationProps.txt. Normalizer2 nfkc=NFKCModeImpl.INSTANCE.normalizer2; UCaseProps csp=UCaseProps.INSTANCE; // first: b = NFKC(Fold(a)) StringBuilder folded=new StringBuilder(); int folded1Length=csp.toFullFolding(c, folded, 0); if(folded1Length<0) { Normalizer2Impl nfkcImpl=((Norm2AllModes.Normalizer2WithImpl)nfkc).impl; if(nfkcImpl.getCompQuickCheck(nfkcImpl.getNorm16(c))!=0) { return ""; // c does not change at all under CaseFolding+NFKC } folded.appendCodePoint(c); } else { if(folded1Length>UCaseProps.MAX_STRING_LENGTH) { folded.appendCodePoint(folded1Length); } } String kc1=nfkc.normalize(folded); // second: c = NFKC(Fold(b)) String kc2=nfkc.normalize(UCharacter.foldCase(kc1, 0)); // if (c != b) add the mapping from a to c if(kc1.equals(kc2)) { return ""; } else { return kc2; } } //------------------------------------------------------------------------- // Iteration API //------------------------------------------------------------------------- /** * Return the current character in the normalized text. * @return The codepoint as an int * @stable ICU 2.8 */ public int current() { if(bufferPos0 || previousNormalize()) { int c=buffer.codePointBefore(bufferPos); bufferPos-=Character.charCount(c); return c; } else { return DONE; } } /** * Reset the index to the beginning of the text. * This is equivalent to setIndexOnly(startIndex)). * @stable ICU 2.8 */ public void reset() { text.setToStart(); currentIndex=nextIndex=0; clearBuffer(); } /** * Set the iteration position in the input text that is being normalized, * without any immediate normalization. * After setIndexOnly(), getIndex() will return the same index that is * specified here. * * @param index the desired index in the input text. * @stable ICU 2.8 */ public void setIndexOnly(int index) { text.setIndex(index); // validates index currentIndex=nextIndex=index; clearBuffer(); } /** * Set the iteration position in the input text that is being normalized * and return the first normalized character at that position. *

* Note: This method sets the position in the input text, * while {@link #next} and {@link #previous} iterate through characters * in the normalized output. This means that there is not * necessarily a one-to-one correspondence between characters returned * by next and previous and the indices passed to and * returned from setIndex and {@link #getIndex}. *

* @param index the desired index in the input text. * * @return the first normalized character that is the result of iterating * forward starting at the given index. * * @throws IllegalArgumentException if the given index is less than * {@link #getBeginIndex} or greater than {@link #getEndIndex}. * @deprecated ICU 3.2 * @obsolete ICU 3.2 */ @Deprecated ///CLOVER:OFF public int setIndex(int index) { setIndexOnly(index); return current(); } ///CLOVER:ON /** * Retrieve the index of the start of the input text. This is the begin * index of the CharacterIterator or the start (i.e. 0) of the * String over which this Normalizer is iterating * @deprecated ICU 2.2. Use startIndex() instead. * @return The codepoint as an int * @see #startIndex */ @Deprecated public int getBeginIndex() { return 0; } /** * Retrieve the index of the end of the input text. This is the end index * of the CharacterIterator or the length of the String * over which this Normalizer is iterating * @deprecated ICU 2.2. Use endIndex() instead. * @return The codepoint as an int * @see #endIndex */ @Deprecated public int getEndIndex() { return endIndex(); } /** * Return the first character in the normalized text. This resets * the Normalizer's position to the beginning of the text. * @return The codepoint as an int * @stable ICU 2.8 */ public int first() { reset(); return next(); } /** * Return the last character in the normalized text. This resets * the Normalizer's position to be just before the * the input text corresponding to that normalized character. * @return The codepoint as an int * @stable ICU 2.8 */ public int last() { text.setToLimit(); currentIndex=nextIndex=text.getIndex(); clearBuffer(); return previous(); } /** * Retrieve the current iteration position in the input text that is * being normalized. This method is useful in applications such as * searching, where you need to be able to determine the position in * the input text that corresponds to a given normalized output character. *

* Note: This method sets the position in the input, while * {@link #next} and {@link #previous} iterate through characters in the * output. This means that there is not necessarily a one-to-one * correspondence between characters returned by next and * previous and the indices passed to and returned from * setIndex and {@link #getIndex}. * @return The current iteration position * @stable ICU 2.8 */ public int getIndex() { if(bufferPosCharacterIterator or the start (i.e. 0) of the * String over which this Normalizer is iterating * @return The current iteration position * @stable ICU 2.8 */ public int startIndex() { return 0; } /** * Retrieve the index of the end of the input text. This is the end index * of the CharacterIterator or the length of the String * over which this Normalizer is iterating * @return The current iteration position * @stable ICU 2.8 */ public int endIndex() { return text.getLength(); } //------------------------------------------------------------------------- // Iterator attributes //------------------------------------------------------------------------- /** * Set the normalization mode for this object. *

* Note:If the normalization mode is changed while iterating * over a string, calls to {@link #next} and {@link #previous} may * return previously buffers characters in the old normalization mode * until the iteration is able to re-sync at the next base character. * It is safest to call {@link #setText setText()}, {@link #first}, * {@link #last}, etc. after calling setMode. *

* @param newMode the new mode for this Normalizer. * The supported modes are: *

    *
  • {@link #NFC} - Unicode canonical decompositiion * followed by canonical composition. *
  • {@link #NFKC} - Unicode compatibility decompositiion * follwed by canonical composition. *
  • {@link #NFD} - Unicode canonical decomposition *
  • {@link #NFKD} - Unicode compatibility decomposition. *
  • {@link #NONE} - Do nothing but return characters * from the underlying input text. *
* * @see #getMode * @stable ICU 2.8 */ public void setMode(Mode newMode) { mode = newMode; norm2 = mode.getNormalizer2(options); } /** * Return the basic operation performed by this Normalizer * * @see #setMode * @stable ICU 2.8 */ public Mode getMode() { return mode; } /** * Set options that affect this Normalizer's operation. * Options do not change the basic composition or decomposition operation * that is being performed , but they control whether * certain optional portions of the operation are done. * Currently the only available option is: *

*

    *
  • {@link #UNICODE_3_2} - Use Normalization conforming to Unicode version 3.2. *
*

* @param option the option whose value is to be set. * @param value the new setting for the option. Use true to * turn the option on and false to turn it off. * * @see #getOption * @stable ICU 2.6 */ public void setOption(int option,boolean value) { if (value) { options |= option; } else { options &= (~option); } norm2 = mode.getNormalizer2(options); } /** * Determine whether an option is turned on or off. *

* @see #setOption * @stable ICU 2.6 */ public int getOption(int option) { if((options & option)!=0) { return 1 ; } else { return 0; } } /** * Gets the underlying text storage * @param fillIn the char buffer to fill the UTF-16 units. * The length of the buffer should be equal to the length of the * underlying text storage * @throws IndexOutOfBoundsException If the index passed for the array is invalid. * @see #getLength * @stable ICU 2.8 */ public int getText(char[] fillIn) { return text.getText(fillIn); } /** * Gets the length of underlying text storage * @return the length * @stable ICU 2.8 */ public int getLength() { return text.getLength(); } /** * Returns the text under iteration as a string * @return a copy of the text under iteration. * @stable ICU 2.8 */ public String getText() { return text.getText(); } /** * Set the input text over which this Normalizer will iterate. * The iteration position is set to the beginning of the input text. * @param newText The new string to be normalized. * @stable ICU 2.8 */ public void setText(StringBuffer newText) { UCharacterIterator newIter = UCharacterIterator.getInstance(newText); if (newIter == null) { throw new IllegalStateException("Could not create a new UCharacterIterator"); } text = newIter; reset(); } /** * Set the input text over which this Normalizer will iterate. * The iteration position is set to the beginning of the input text. * @param newText The new string to be normalized. * @stable ICU 2.8 */ public void setText(char[] newText) { UCharacterIterator newIter = UCharacterIterator.getInstance(newText); if (newIter == null) { throw new IllegalStateException("Could not create a new UCharacterIterator"); } text = newIter; reset(); } /** * Set the input text over which this Normalizer will iterate. * The iteration position is set to the beginning of the input text. * @param newText The new string to be normalized. * @stable ICU 2.8 */ public void setText(String newText) { UCharacterIterator newIter = UCharacterIterator.getInstance(newText); if (newIter == null) { throw new IllegalStateException("Could not create a new UCharacterIterator"); } text = newIter; reset(); } /** * Set the input text over which this Normalizer will iterate. * The iteration position is set to the beginning of the input text. * @param newText The new string to be normalized. * @stable ICU 2.8 */ public void setText(CharacterIterator newText) { UCharacterIterator newIter = UCharacterIterator.getInstance(newText); if (newIter == null) { throw new IllegalStateException("Could not create a new UCharacterIterator"); } text = newIter; reset(); } /** * Set the input text over which this Normalizer will iterate. * The iteration position is set to the beginning of the string. * @param newText The new string to be normalized. * @stable ICU 2.8 */ public void setText(UCharacterIterator newText) { try{ UCharacterIterator newIter = (UCharacterIterator)newText.clone(); if (newIter == null) { throw new IllegalStateException("Could not create a new UCharacterIterator"); } text = newIter; reset(); }catch(CloneNotSupportedException e) { throw new ICUCloneNotSupportedException("Could not clone the UCharacterIterator", e); } } private void clearBuffer() { buffer.setLength(0); bufferPos=0; } private boolean nextNormalize() { clearBuffer(); currentIndex=nextIndex; text.setIndex(nextIndex); // Skip at least one character so we make progress. int c=text.nextCodePoint(); if(c<0) { return false; } StringBuilder segment=new StringBuilder().appendCodePoint(c); while((c=text.nextCodePoint())>=0) { if(norm2.hasBoundaryBefore(c)) { text.moveCodePointIndex(-1); break; } segment.appendCodePoint(c); } nextIndex=text.getIndex(); norm2.normalize(segment, buffer); return buffer.length()!=0; } private boolean previousNormalize() { clearBuffer(); nextIndex=currentIndex; text.setIndex(currentIndex); StringBuilder segment=new StringBuilder(); int c; while((c=text.previousCodePoint())>=0) { if(c<=0xffff) { segment.insert(0, (char)c); } else { segment.insert(0, Character.toChars(c)); } if(norm2.hasBoundaryBefore(c)) { break; } } currentIndex=text.getIndex(); norm2.normalize(segment, buffer); bufferPos=buffer.length(); return buffer.length()!=0; } /* compare canonically equivalent ------------------------------------------- */ // TODO: Broaden the public compare(String, String, options) API like this. Ticket #7407 private static int internalCompare(CharSequence s1, CharSequence s2, int options) { int normOptions=options>>>COMPARE_NORM_OPTIONS_SHIFT; options|= COMPARE_EQUIV; /* * UAX #21 Case Mappings, as fixed for Unicode version 4 * (see Jitterbug 2021), defines a canonical caseless match as * * A string X is a canonical caseless match * for a string Y if and only if * NFD(toCasefold(NFD(X))) = NFD(toCasefold(NFD(Y))) * * For better performance, we check for FCD (or let the caller tell us that * both strings are in FCD) for the inner normalization. * BasicNormalizerTest::FindFoldFCDExceptions() makes sure that * case-folding preserves the FCD-ness of a string. * The outer normalization is then only performed by NormalizerImpl.cmpEquivFold() * when there is a difference. * * Exception: When using the Turkic case-folding option, we do perform * full NFD first. This is because in the Turkic case precomposed characters * with 0049 capital I or 0069 small i fold differently whether they * are first decomposed or not, so an FCD check - a check only for * canonical order - is not sufficient. */ if((options&INPUT_IS_FCD)==0 || (options&FOLD_CASE_EXCLUDE_SPECIAL_I)!=0) { Normalizer2 n2; if((options&FOLD_CASE_EXCLUDE_SPECIAL_I)!=0) { n2=NFD.getNormalizer2(normOptions); } else { n2=FCD.getNormalizer2(normOptions); } // check if s1 and/or s2 fulfill the FCD conditions int spanQCYes1=n2.spanQuickCheckYes(s1); int spanQCYes2=n2.spanQuickCheckYes(s2); /* * ICU 2.4 had a further optimization: * If both strings were not in FCD, then they were both NFD'ed, * and the COMPARE_EQUIV option was turned off. * It is not entirely clear that this is valid with the current * definition of the canonical caseless match. * Therefore, ICU 2.6 removes that optimization. */ if(spanQCYes1=0 && c2>=0 */ /* get complete code points for c1, c2 for lookups if either is a surrogate */ cp1=c1; if(UTF16.isSurrogate((char)c1)) { char c; if(Normalizer2Impl.UTF16Plus.isSurrogateLead(c1)) { if(s1!=limit1 && Character.isLowSurrogate(c=cs1.charAt(s1))) { /* advance ++s1; only below if cp1 decomposes/case-folds */ cp1=Character.toCodePoint((char)c1, c); } } else /* isTrail(c1) */ { if(0<=(s1-2) && Character.isHighSurrogate(c=cs1.charAt(s1-2))) { cp1=Character.toCodePoint(c, (char)c1); } } } cp2=c2; if(UTF16.isSurrogate((char)c2)) { char c; if(Normalizer2Impl.UTF16Plus.isSurrogateLead(c2)) { if(s2!=limit2 && Character.isLowSurrogate(c=cs2.charAt(s2))) { /* advance ++s2; only below if cp2 decomposes/case-folds */ cp2=Character.toCodePoint((char)c2, c); } } else /* isTrail(c2) */ { if(0<=(s2-2) && Character.isHighSurrogate(c=cs2.charAt(s2-2))) { cp2=Character.toCodePoint(c, (char)c2); } } } /* * go down one level for each string * continue with the main loop as soon as there is a real change */ if( level1==0 && (options&COMPARE_IGNORE_CASE)!=0 && (length=csp.toFullFolding(cp1, fold1, options))>=0 ) { /* cp1 case-folds to the code point "length" or to p[length] */ if(UTF16.isSurrogate((char)c1)) { if(Normalizer2Impl.UTF16Plus.isSurrogateLead(c1)) { /* advance beyond source surrogate pair if it case-folds */ ++s1; } else /* isTrail(c1) */ { /* * we got a supplementary code point when hitting its trail surrogate, * therefore the lead surrogate must have been the same as in the other string; * compare this decomposition with the lead surrogate in the other string * remember that this simulates bulk text replacement: * the decomposition would replace the entire code point */ --s2; c2=cs2.charAt(s2-1); } } /* push current level pointers */ if(stack1==null) { stack1=createCmpEquivLevelStack(); } stack1[0].cs=cs1; stack1[0].s=s1; ++level1; /* copy the folding result to fold1[] */ /* Java: the buffer was probably not empty, remove the old contents */ if(length<=UCaseProps.MAX_STRING_LENGTH) { fold1.delete(0, fold1.length()-length); } else { fold1.setLength(0); fold1.appendCodePoint(length); } /* set next level pointers to case folding */ cs1=fold1; s1=0; limit1=fold1.length(); /* get ready to read from decomposition, continue with loop */ c1=-1; continue; } if( level2==0 && (options&COMPARE_IGNORE_CASE)!=0 && (length=csp.toFullFolding(cp2, fold2, options))>=0 ) { /* cp2 case-folds to the code point "length" or to p[length] */ if(UTF16.isSurrogate((char)c2)) { if(Normalizer2Impl.UTF16Plus.isSurrogateLead(c2)) { /* advance beyond source surrogate pair if it case-folds */ ++s2; } else /* isTrail(c2) */ { /* * we got a supplementary code point when hitting its trail surrogate, * therefore the lead surrogate must have been the same as in the other string; * compare this decomposition with the lead surrogate in the other string * remember that this simulates bulk text replacement: * the decomposition would replace the entire code point */ --s1; c1=cs1.charAt(s1-1); } } /* push current level pointers */ if(stack2==null) { stack2=createCmpEquivLevelStack(); } stack2[0].cs=cs2; stack2[0].s=s2; ++level2; /* copy the folding result to fold2[] */ /* Java: the buffer was probably not empty, remove the old contents */ if(length<=UCaseProps.MAX_STRING_LENGTH) { fold2.delete(0, fold2.length()-length); } else { fold2.setLength(0); fold2.appendCodePoint(length); } /* set next level pointers to case folding */ cs2=fold2; s2=0; limit2=fold2.length(); /* get ready to read from decomposition, continue with loop */ c2=-1; continue; } if( level1<2 && (options&COMPARE_EQUIV)!=0 && (decomp1=nfcImpl.getDecomposition(cp1))!=null ) { /* cp1 decomposes into p[length] */ if(UTF16.isSurrogate((char)c1)) { if(Normalizer2Impl.UTF16Plus.isSurrogateLead(c1)) { /* advance beyond source surrogate pair if it decomposes */ ++s1; } else /* isTrail(c1) */ { /* * we got a supplementary code point when hitting its trail surrogate, * therefore the lead surrogate must have been the same as in the other string; * compare this decomposition with the lead surrogate in the other string * remember that this simulates bulk text replacement: * the decomposition would replace the entire code point */ --s2; c2=cs2.charAt(s2-1); } } /* push current level pointers */ if(stack1==null) { stack1=createCmpEquivLevelStack(); } stack1[level1].cs=cs1; stack1[level1].s=s1; ++level1; /* set empty intermediate level if skipped */ if(level1<2) { stack1[level1++].cs=null; } /* set next level pointers to decomposition */ cs1=decomp1; s1=0; limit1=decomp1.length(); /* get ready to read from decomposition, continue with loop */ c1=-1; continue; } if( level2<2 && (options&COMPARE_EQUIV)!=0 && (decomp2=nfcImpl.getDecomposition(cp2))!=null ) { /* cp2 decomposes into p[length] */ if(UTF16.isSurrogate((char)c2)) { if(Normalizer2Impl.UTF16Plus.isSurrogateLead(c2)) { /* advance beyond source surrogate pair if it decomposes */ ++s2; } else /* isTrail(c2) */ { /* * we got a supplementary code point when hitting its trail surrogate, * therefore the lead surrogate must have been the same as in the other string; * compare this decomposition with the lead surrogate in the other string * remember that this simulates bulk text replacement: * the decomposition would replace the entire code point */ --s1; c1=cs1.charAt(s1-1); } } /* push current level pointers */ if(stack2==null) { stack2=createCmpEquivLevelStack(); } stack2[level2].cs=cs2; stack2[level2].s=s2; ++level2; /* set empty intermediate level if skipped */ if(level2<2) { stack2[level2++].cs=null; } /* set next level pointers to decomposition */ cs2=decomp2; s2=0; limit2=decomp2.length(); /* get ready to read from decomposition, continue with loop */ c2=-1; continue; } /* * no decomposition/case folding, max level for both sides: * return difference result * * code point order comparison must not just return cp1-cp2 * because when single surrogates are present then the surrogate pairs * that formed cp1 and cp2 may be from different string indexes * * example: { d800 d800 dc01 } vs. { d800 dc00 }, compare at second code units * c1=d800 cp1=10001 c2=dc00 cp2=10000 * cp1-cp2>0 but c1-c2<0 and in fact in UTF-32 it is { d800 10001 } < { 10000 } * * therefore, use same fix-up as in ustring.c/uprv_strCompare() * except: uprv_strCompare() fetches c=*s while this functions fetches c=*s++ * so we have slightly different pointer/start/limit comparisons here */ if(c1>=0xd800 && c2>=0xd800 && (options&COMPARE_CODE_POINT_ORDER)!=0) { /* subtract 0x2800 from BMP code points to make them smaller than supplementary ones */ if( (c1<=0xdbff && s1!=limit1 && Character.isLowSurrogate(cs1.charAt(s1))) || (Character.isLowSurrogate((char)c1) && 0!=(s1-1) && Character.isHighSurrogate(cs1.charAt(s1-2))) ) { /* part of a surrogate pair, leave >=d800 */ } else { /* BMP code point - may be surrogate code point - make =d800 */ } else { /* BMP code point - may be surrogate code point - make * An overflow is only reported at the end, for the old Normalizer API functions that write * to char arrays. */ private static final class CharsAppendable implements Appendable { public CharsAppendable(char[] dest, int destStart, int destLimit) { chars=dest; start=offset=destStart; limit=destLimit; } public int length() { int len=offset-start; if(offset<=limit) { return len; } else { throw new IndexOutOfBoundsException(Integer.toString(len)); } } public Appendable append(char c) { if(offset





© 2015 - 2024 Weber Informatics LLC | Privacy Policy