All Downloads are FREE. Search and download functionalities are using the official Maven repository.

edu.stanford.nlp.process.PTBLexer.flex Maven / Gradle / Ivy

Go to download

Stanford CoreNLP provides a set of natural language analysis tools which can take raw English language text input and give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, mark up the structure of sentences in terms of phrases and word dependencies, and indicate which noun phrases refer to the same entities. It provides the foundational building blocks for higher level text understanding applications.

There is a newer version: 4.5.7
Show newest version
package edu.stanford.nlp.process;

// Stanford English Tokenizer -- a deterministic, fast high-quality tokenizer
// Copyright (c) 2002-2009 The Board of Trustees of
// The Leland Stanford Junior University. All Rights Reserved.
//
// This program is free software; you can redistribute it and/or
// modify it under the terms of the GNU General Public License
// as published by the Free Software Foundation; either version 2
// of the License, or (at your option) any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
// GNU General Public License for more details.
//
// You should have received a copy of the GNU General Public License
// along with this program; if not, write to the Free Software
// Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA  02111-1307, USA.
//
// For more information, bug reports, fixes, contact:
//    Christopher Manning
//    Dept of Computer Science, Gates 1A
//    Stanford CA 94305-9010
//    USA
//    [email protected]
//    http://nlp.stanford.edu/software/


import java.io.Reader;
import java.util.Map;
import java.util.Properties;
import java.util.Set;
import java.util.regex.Pattern;

import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.ling.CoreAnnotations;
import edu.stanford.nlp.util.StringUtils;
import edu.stanford.nlp.util.logging.Redwood;


/** Provides a tokenizer or lexer that does a pretty good job at
 *  deterministically tokenizing English according to Penn Treebank conventions.
 *  The class is a scanner generated by
 *  JFlex from the specification file
 *  PTBLexer.flex.  As well as copying what is in the Treebank,
 *  it now contains some extensions to deal with modern text and encoding
 *  issues, such as recognizing URLs and common Unicode characters, and a
 *  variety of options for doing or suppressing certain normalizations.
 *  Although they shouldn't really be there, it also interprets certain of the
 *  characters between U+0080 and U+009F as Windows CP1252 characters.
 *  

* Fine points: Output normalized tokens should not contain spaces, * providing the normalizeSpace option is true. The space will be turned * into a non-breaking space (U+00A0). Otherwise, they can appear in * a couple of token classes (phone numbers, fractions). * The original * PTB tokenization (messy) standard also escapes certain other characters, * such as * and /, and normalizes things like " to `` or ''. By default, * this tokenizer does all of those things. However, you can turn them * off by using the ptb3Escaping=false option, or, parts of it on or off, * or unicode * character alternatives on with different options. You can also build an * invertible tokenizer, with which you can still access the original * character sequence and the non-token whitespace around it in a CoreLabel. * And you can ask for newlines to be tokenized. *

* Character entities: For legacy reasons, this file will parse and interpret * some simple SGML/XML/HTML tags and character entities. For modern formats * like XML, you are better off doing XML parsing, and then running the * tokenizer on CDATA elements. But, we and others frequently work with simple * SGML text corpora that are not XML (like LDC text collections). In practice, * they only include very simple markup and a few simple entities, and the * combination of the -parseInside option and the minimal character entity * support in this file is enough to handle them. So we leave this functionality * in, even though it could conceivably mess with a correct XML file if the * output of decoding had things that look like character entities. In general, * handled symbols are changed to ASCII/Unicode forms, but handled accented * letters are just left as character entities in words. *

* Character support: PTBLexer works works for a large subset of * Unicode Base Multilingual Plane characters (only). It recognizes all * characters that match the JFlex/Java [:letter:] and [:digit] character * class (but, unfortunately, JFlex does not support most * other Unicode character classes available in Java regular expressions). * It also matches all defined characters in the Unicode ranges U+0000-U+07FF * excluding control characters except the ones very standardly found in * plain text documents. Finally select other characters commonly found in * English unicode text are included. *

* Implementation note: The scanner is caseless, but note, if adding * or changing regexps, that caseless does not expand inside character * classes. From the manual: "The %caseless option does not change the * matched text and does not effect character classes. So [a] still only * matches the character a and not A, too." Note that some character * classes still deliberately don't have both cases, so the scanner's * operation isn't completely case-independent, though it mostly is. *

* Implementation note: This Java class is automatically generated * from PTBLexer.flex using jflex. DO NOT EDIT THE JAVA SOURCE. This file * has now been updated for JFlex 1.4.2+. (This required code changes: this * version only works right with JFlex 1.4.2+; the previous version only works * right with JFlex 1.4.1.) * * @author Tim Grow * @author Christopher Manning * @author Jenny Finkel */ %% %class PTBLexer %unicode %function next %type Object %char %caseless %state YyTokenizePerLine YyNotTokenizePerLine %{ /** * Constructs a new PTBLexer. You specify the type of result tokens with a * LexedTokenFactory, and can specify the treatment of tokens by boolean * options given in a comma separated String * (e.g., "invertible,normalizeParentheses=true"). * If the String is null or empty, you get the traditional * PTB3 normalization behaviour (i.e., you get ptb3Escaping=false). If you * want no normalization, then you should pass in the String * "ptb3Escaping=false". See the documentation in the {@link PTBTokenizer} * class for full discussion of all the available options. * * @param r The Reader to tokenize text from * @param tf The LexedTokenFactory that will be invoked to convert * each substring extracted by the lexer into some kind of Object * (such as a Word or CoreLabel). * @param options Options to the tokenizer (see {@link PTBTokenizer}) */ public PTBLexer(Reader r, LexedTokenFactory tf, String options) { this(r); this.tokenFactory = tf; if (options == null) { options = ""; } Properties prop = StringUtils.stringToProperties(options); Set> props = prop.entrySet(); for (Map.Entry item : props) { String key = (String) item.getKey(); String value = (String) item.getValue(); boolean val = Boolean.valueOf(value); if ("".equals(key)) { // allow an empty item } else if ("invertible".equals(key)) { invertible = val; } else if ("tokenizeNLs".equals(key)) { tokenizeNLs = val; } else if ("tokenizePerLine".equals(key)) { tokenizePerLine = val; } else if ("ptb3Escaping".equals(key)) { normalizeSpace = val; normalizeAmpersandEntity = val; normalizeCurrency = val; normalizeFractions = val; normalizeParentheses = val; normalizeOtherBrackets = val; latexQuotes = val; unicodeQuotes = val; asciiQuotes = val; ptb3Ellipsis = val; unicodeEllipsis = val; ptb3Dashes = val; } else if ("americanize".equals(key)) { americanize = val; } else if ("normalizeSpace".equals(key)) { normalizeSpace = val; } else if ("normalizeAmpersandEntity".equals(key)) { normalizeAmpersandEntity = val; } else if ("normalizeCurrency".equals(key)) { normalizeCurrency = val; } else if ("normalizeFractions".equals(key)) { normalizeFractions = val; } else if ("normalizeParentheses".equals(key)) { normalizeParentheses = val; } else if ("normalizeOtherBrackets".equals(key)) { normalizeOtherBrackets = val; } else if ("latexQuotes".equals(key)) { latexQuotes = val; } else if ("unicodeQuotes".equals(key)) { unicodeQuotes = val; if (val) { latexQuotes = false; // need to override default } } else if ("asciiQuotes".equals(key)) { asciiQuotes = val; if (val) { latexQuotes = false; // need to override default unicodeQuotes = false; } } else if ("splitAssimilations".equals(key)) { splitAssimilations = val; } else if ("splitHyphenated".equals(key)) { splitHyphenated = val; } else if ("ptb3Ellipsis".equals(key)) { ptb3Ellipsis = val; } else if ("unicodeEllipsis".equals(key)) { unicodeEllipsis = val; } else if ("ptb3Dashes".equals(key)) { ptb3Dashes = val; } else if ("escapeForwardSlashAsterisk".equals(key)) { escapeForwardSlashAsterisk = val; } else if ("untokenizable".equals(key)) { switch (value) { case "noneDelete": untokenizable = UntokenizableOptions.NONE_DELETE; break; case "firstDelete": untokenizable = UntokenizableOptions.FIRST_DELETE; break; case "allDelete": untokenizable = UntokenizableOptions.ALL_DELETE; break; case "noneKeep": untokenizable = UntokenizableOptions.NONE_KEEP; break; case "firstKeep": untokenizable = UntokenizableOptions.FIRST_KEEP; break; case "allKeep": untokenizable = UntokenizableOptions.ALL_KEEP; break; default: throw new IllegalArgumentException("PTBLexer: Invalid option value in constructor: " + key + ": " + value); } } else if ("strictTreebank3".equals(key)) { strictTreebank3 = val; } else { throw new IllegalArgumentException("PTBLexer: Invalid options key in constructor: " + key); } } // this.seenUntokenizableCharacter = false; // unnecessary, it's default initialized if (invertible) { if ( ! (tf instanceof CoreLabelTokenFactory)) { throw new IllegalArgumentException("PTBLexer: the invertible option requires a CoreLabelTokenFactory"); } prevWord = (CoreLabel) tf.makeToken("", 0, 0); prevWordAfter = new StringBuilder(); } if (tokenizePerLine) { yybegin(YyTokenizePerLine); } else { yybegin(YyNotTokenizePerLine); } } /** A logger for this class */ private static final Redwood.RedwoodChannels logger = Redwood.channels(PTBLexer.class); private LexedTokenFactory tokenFactory; private CoreLabel prevWord; private StringBuilder prevWordAfter; private boolean seenUntokenizableCharacter; private enum UntokenizableOptions { NONE_DELETE, FIRST_DELETE, ALL_DELETE, NONE_KEEP, FIRST_KEEP, ALL_KEEP } private UntokenizableOptions untokenizable = UntokenizableOptions.FIRST_DELETE; /* Flags begin with historical ptb3Escaping behavior */ private boolean invertible; private boolean tokenizeNLs; private boolean tokenizePerLine; private boolean americanize = false; private boolean normalizeSpace = true; private boolean normalizeAmpersandEntity = true; private boolean normalizeCurrency = true; private boolean normalizeFractions = true; private boolean normalizeParentheses = true; private boolean normalizeOtherBrackets = true; private boolean latexQuotes = true; private boolean unicodeQuotes; private boolean asciiQuotes; private boolean ptb3Ellipsis = true; private boolean unicodeEllipsis; private boolean ptb3Dashes = true; private boolean escapeForwardSlashAsterisk = false; private boolean strictTreebank3 = false; private boolean splitAssimilations = true; private boolean splitHyphenated = false; /* * This has now been extended to cover the main Windows CP1252 characters, * at either their correct Unicode codepoints, or in their invalid * positions as 8 bit chars inside the iso-8859 control region. * * ellipsis 85 0133 2026 8230 * single quote curly starting 91 0145 2018 8216 * single quote curly ending 92 0146 2019 8217 * double quote curly starting 93 0147 201C 8220 * double quote curly ending 94 0148 201D 8221 * bullet 95 * en dash 96 0150 2013 8211 * em dash 97 0151 2014 8212 */ /* Bracket characters and forward slash and asterisk: * * Original Treebank 3 WSJ * Uses -LRB- -RRB- as the representation for ( ) and -LCB- -RCB- as the representation for { }. * There are no occurrences of [ ], though there is some mention of -LSB- -RSB- in early documents. * There are no occurrences of < >. * All brackets are tagged -LRB- -RRB- [This stays constant.] * Forward slash and asterisk are escaped by a preceding \ (as \/ and \*) * * Treebank 3 Brown corpus * Has -LRB- -RRB- * Has a few instances of unescaped [ ] in compounds (the token "A[fj]" * Neither forward slash or asterisk appears. * * Ontonotes (r4) * Uses -LRB- -RRB- -LCB- -RCB- -LSB- -RSB-. * Has a very few uses of < and > in longer tokens, which are not escaped. * Slash is not escaped. Asterisk is not escaped. * * LDC2012T13-eng_web_tbk (Google web treebank) * Has -LRB- -RRB- * Has { and } used unescaped, treated as brackets. * Has < and > used unescaped, sometimes treated as brackets. Sometimes << and >> are treated as brackets! * Has [ and ] used unescaped, treated as brackets. * Slash is not escaped. Asterisk is not escaped. * * Reasonable conclusions for now: * - Never escape < > * - Still by default escape [ ] { } but it can be turned off. Use -LSB- -RSB- -LCB- -RCB-. * Move to not escaping slash and asterisk, and delete escaping in PennTreeReader. */ public static final String openparen = "-LRB-"; public static final String closeparen = "-RRB-"; public static final String openbrace = "-LCB-"; public static final String closebrace = "-RCB-"; public static final String ptbmdash = "--"; public static final String ptb3EllipsisStr = "..."; public static final String unicodeEllipsisStr = "\u2026"; /** For tokenizing carriage returns. (JS) */ public static final String NEWLINE_TOKEN = "*NL*"; /* This pattern now also include newlines, since we sometimes allow them in SGML tokens....*/ private static final Pattern SINGLE_SPACE_PATTERN = Pattern.compile("[ \r\n]"); private static final Pattern LEFT_PAREN_PATTERN = Pattern.compile("\\("); private static final Pattern RIGHT_PAREN_PATTERN = Pattern.compile("\\)"); private static final Pattern AMP_PATTERN = Pattern.compile("(?i:&)"); private static final Pattern ONE_FOURTH_PATTERN = Pattern.compile("\u00BC"); private static final Pattern ONE_HALF_PATTERN = Pattern.compile("\u00BD"); private static final Pattern THREE_FOURTHS_PATTERN = Pattern.compile("\u00BE"); private static final Pattern ONE_THIRD_PATTERN = Pattern.compile("\u2153"); private static final Pattern TWO_THIRDS_PATTERN = Pattern.compile("\u2154"); private Object normalizeFractions(final String in) { String out = in; if (normalizeFractions) { if (escapeForwardSlashAsterisk) { out = ONE_FOURTH_PATTERN.matcher(out).replaceAll("1\\\\/4"); out = ONE_HALF_PATTERN.matcher(out).replaceAll("1\\\\/2"); out = THREE_FOURTHS_PATTERN.matcher(out).replaceAll("3\\\\/4"); out = ONE_THIRD_PATTERN.matcher(out).replaceAll("1\\\\/3"); out = TWO_THIRDS_PATTERN.matcher(out).replaceAll("2\\\\/3"); } else { out = ONE_FOURTH_PATTERN.matcher(out).replaceAll("1/4"); out = ONE_HALF_PATTERN.matcher(out).replaceAll("1/2"); out = THREE_FOURTHS_PATTERN.matcher(out).replaceAll("3/4"); out = ONE_THIRD_PATTERN.matcher(out).replaceAll("1/3"); out = TWO_THIRDS_PATTERN.matcher(out).replaceAll("2/3"); } } // System.err.println("normalizeFractions="+normalizeFractions+", escapeForwardSlashAsterisk="+escapeForwardSlashAsterisk); // System.err.println("Mapped |"+in+"| to |" + out + "|."); return getNext(out, in); } private void breakByHyphens(String in) { if (splitHyphenated) { int firstHyphen = in.indexOf('-'); yypushback(in.length() - firstHyphen); } } private static String removeSoftHyphens(String in) { // \u00AD is the soft hyphen character, which we remove, regarding it as inserted only for line-breaking if (in.indexOf('\u00AD') < 0) { // shortcut doing work return in; } int length = in.length(); StringBuilder out = new StringBuilder(length - 1); for (int i = 0; i < length; i++) { char ch = in.charAt(i); if (ch != '\u00AD') { out.append(ch); } } if (out.length() == 0) { out.append('-'); // don't create an empty token } return out.toString(); } private static final Pattern CENTS_PATTERN = Pattern.compile("\u00A2"); private static final Pattern POUND_PATTERN = Pattern.compile("\u00A3"); private static final Pattern GENERIC_CURRENCY_PATTERN = Pattern.compile("[\u0080\u00A4\u20A0\u20AC]"); private static String normalizeCurrency(String in) { String s1 = in; s1 = CENTS_PATTERN.matcher(s1).replaceAll("cents"); s1 = POUND_PATTERN.matcher(s1).replaceAll("#"); // historically used for pound in PTB3 s1 = GENERIC_CURRENCY_PATTERN.matcher(s1).replaceAll("\\$"); // Euro (ECU, generic currency) -- no good translation! return s1; } private static final Pattern singleQuote = Pattern.compile("'|'"); private static final Pattern doubleQuote = Pattern.compile("\"|""); // 91,92,93,94 aren't valid unicode points, but sometimes they show // up from cp1252 and need to be translated private static final Pattern leftSingleQuote = Pattern.compile("[\u0091\u2018\u201B\u2039]"); private static final Pattern rightSingleQuote = Pattern.compile("[\u0092\u2019\u203A]"); private static final Pattern leftDoubleQuote = Pattern.compile("[\u0093\u201C\u00AB]"); private static final Pattern rightDoubleQuote = Pattern.compile("[\u0094\u201D\u00BB]"); private static String latexQuotes(String in, boolean probablyLeft) { String s1 = in; if (probablyLeft) { s1 = singleQuote.matcher(s1).replaceAll("`"); s1 = doubleQuote.matcher(s1).replaceAll("``"); } else { s1 = singleQuote.matcher(s1).replaceAll("'"); s1 = doubleQuote.matcher(s1).replaceAll("''"); } s1 = leftSingleQuote.matcher(s1).replaceAll("`"); s1 = rightSingleQuote.matcher(s1).replaceAll("'"); s1 = leftDoubleQuote.matcher(s1).replaceAll("``"); s1 = rightDoubleQuote.matcher(s1).replaceAll("''"); return s1; } private static final Pattern asciiSingleQuote = Pattern.compile("'|[\u0091\u2018\u0092\u2019\u201A\u201B\u2039\u203A']"); private static final Pattern asciiDoubleQuote = Pattern.compile(""|[\u0093\u201C\u0094\u201D\u201E\u00AB\u00BB\"]"); private static String asciiQuotes(String in) { String s1 = in; s1 = asciiSingleQuote.matcher(s1).replaceAll("'"); s1 = asciiDoubleQuote.matcher(s1).replaceAll("\""); return s1; } private static final Pattern unicodeLeftSingleQuote = Pattern.compile("\u0091"); private static final Pattern unicodeRightSingleQuote = Pattern.compile("\u0092"); private static final Pattern unicodeLeftDoubleQuote = Pattern.compile("\u0093"); private static final Pattern unicodeRightDoubleQuote = Pattern.compile("\u0094"); private static String unicodeQuotes(String in, boolean probablyLeft) { String s1 = in; if (probablyLeft) { s1 = singleQuote.matcher(s1).replaceAll("\u2018"); s1 = doubleQuote.matcher(s1).replaceAll("\u201c"); } else { s1 = singleQuote.matcher(s1).replaceAll("\u2019"); s1 = doubleQuote.matcher(s1).replaceAll("\u201d"); } s1 = unicodeLeftSingleQuote.matcher(s1).replaceAll("\u2018"); s1 = unicodeRightSingleQuote.matcher(s1).replaceAll("\u2019"); s1 = unicodeLeftDoubleQuote.matcher(s1).replaceAll("\u201c"); s1 = unicodeRightDoubleQuote.matcher(s1).replaceAll("\u201d"); return s1; } private Object handleQuotes(String tok, boolean probablyLeft) { String normTok; if (latexQuotes) { normTok = latexQuotes(tok, probablyLeft); } else if (unicodeQuotes) { normTok = unicodeQuotes(tok, probablyLeft); } else if (asciiQuotes) { normTok = asciiQuotes(tok); } else { normTok = tok; } // System.err.printf("handleQuotes changed %s to %s.%n", tok, normTok); return getNext(normTok, tok); } private Object handleEllipsis(final String tok) { if (ptb3Ellipsis) { return getNext(ptb3EllipsisStr, tok); } else if (unicodeEllipsis) { return getNext(unicodeEllipsisStr, tok); } else { return getNext(tok, tok); } } /** This quotes a character with a backslash, but doesn't do it * if the character is already preceded by a backslash. */ private static String delimit(String s, char c) { int i = s.indexOf(c); while (i != -1) { if (i == 0 || s.charAt(i - 1) != '\\') { s = s.substring(0, i) + '\\' + s.substring(i); i = s.indexOf(c, i + 2); } else { i = s.indexOf(c, i + 1); } } return s; } private static String normalizeAmp(final String in) { return AMP_PATTERN.matcher(in).replaceAll("&"); } private int indexOfSpace(String txt) { for (int i = 0, len = txt.length(); i < len; i++) { char ch = txt.charAt(i); if (ch == ' ' || ch == '\u00A0') { return i; } } return -1; } private Object getNext() { final String txt = yytext(); return getNext(txt, txt); } /** Make the next token. * @param txt What the token should be * @param originalText The original String that got transformed into txt */ private Object getNext(String txt, String originalText) { if (invertible) { String str = prevWordAfter.toString(); prevWordAfter.setLength(0); CoreLabel word = (CoreLabel) tokenFactory.makeToken(txt, yychar, yylength()); word.set(CoreAnnotations.OriginalTextAnnotation.class, originalText); word.set(CoreAnnotations.BeforeAnnotation.class, str); prevWord.set(CoreAnnotations.AfterAnnotation.class, str); prevWord = word; return word; } else { return tokenFactory.makeToken(txt, yychar, yylength()); } } private Object getNormalizedAmpNext() { final String txt = yytext(); if (normalizeAmpersandEntity) { return getNext(normalizeAmp(txt), txt); } else { return getNext(); } } private void fixJFlex4SpaceAfterTokenBug() { // try to work around an apparent jflex bug where it // gets a space at the token end by getting // wrong the length of the trailing context. while (yylength() > 0) { char last = yycharat(yylength()-1); if (last == ' ' || last == '\t' || (last >= '\n' && last <= '\r' || last == '\u0085')) { yypushback(1); } else { break; } } } private Object processAcronym() { fixJFlex4SpaceAfterTokenBug(); String s; if (yylength() == 2) { // "I.", etc. yypushback(1); // return a period next time; s = yytext(); // return the word without the final period } else if (strictTreebank3 && ! "U.S.".equals(yytext())) { yypushback(1); // return a period for next time s = yytext(); // return the word without the final period } else { s = yytext(); // return the word WITH the final period yypushback(1); // (reduplication:) also return a period for next time } return getNext(s, yytext()); } private Object processAbbrev3() { fixJFlex4SpaceAfterTokenBug(); return getNext(); } private Object processAbbrev1() { String s; if (strictTreebank3 && ! "U.S.".equals(yytext())) { yypushback(1); // return a period for next time s = yytext(); } else { s = yytext(); yypushback(1); // return a period for next time } return getNext(s, yytext()); } %} /* Todo: Really SGML shouldn't be here at all, it's kind of legacy. But we continue to tokenize some simple standard forms of concrete SGML syntax, since it tends to give robustness. */ /* --- ( +([A-Za-z][A-Za-z0-9:.-]*( *= *['\"][^\r\n'\"]*['\"])?|['\"][^\r\n'\"]*['\"]| *\/))* SGML = <([!?][A-Za-z-][^>\r\n]*|\/?[A-Za-z][A-Za-z0-9:.-]*([ ]+([A-Za-z][A-Za-z0-9:.-]*([ ]*=[ ]*['\"][^\r\n'\"]*['\"])?|['\"][^\r\n'\"]*['\"]|[ ]*\/))*[ ]*)> ( +[A-Za-z][A-Za-z0-9:.-]*)* FOO = ([ ]+[A-Za-z][A-Za-z0-9:.-]*)* SGML = <([!?][A-Za-z-][^>\r\n]*|\/?[A-Za-z][A-Za-z0-9:.-]* *)> --- */ /* SGML = \<([!\?][A-Za-z\-][^>\r\n]*|\/?[A-Za-z][A-Za-z0-9:\.\-]*([ ]+([A-Za-z][A-Za-z0-9_:\.\-]*|[A-Za-z][A-Za-z0-9_:\.\-]*[ ]*=[ ]*['\"][^\r\n'\"]*['\"]|['\"][^\r\n'\"]*['\"]|[ ]*\/))*[ ]*)\> */ // // SGML1 allows attribute value match over newline; SGML2 does not. SGML1 = \<([!\?][A-Za-z\-][^>\r\n]*|[A-Za-z][A-Za-z0-9_:\.\-]*([ ]+([A-Za-z][A-Za-z0-9_:\.\-]*|[A-Za-z][A-Za-z0-9_:\.\-]*[ ]*=[ ]*('[^']*'|\"[^\"]*\"|[A-Za-z][A-Za-z0-9_:\.\-]*)))*[ ]*\/?|\/[A-Za-z][A-Za-z0-9_:\.\-]*)[ ]*\> SGML2 = \<([!\?][A-Za-z\-][^>\r\n]*|[A-Za-z][A-Za-z0-9_:\.\-]*([ ]+([A-Za-z][A-Za-z0-9_:\.\-]*|[A-Za-z][A-Za-z0-9_:\.\-]*[ ]*=[ ]*('[^'\r\n]*'|\"[^\"\r\n]*\"|[A-Za-z][A-Za-z0-9_:\.\-]*)))*[ ]*\/?|\/[A-Za-z][A-Za-z0-9_:\.\-]*)[ ]*\> SPMDASH = &(MD|mdash|ndash);|[\u0096\u0097\u2013\u2014\u2015] SPAMP = & SPPUNC = &(HT|TL|UR|LR|QC|QL|QR|odq|cdq|#[0-9]+); SPLET = &[aeiouAEIOU](acute|grave|uml); /* \u3000 is ideographic space */ SPACE = [ \t\u00A0\u2000-\u200A\u3000] SPACES = {SPACE}+ NEWLINE = \r|\r?\n|\u2028|\u2029|\u000B|\u000C|\u0085 SPACENL = ({SPACE}|{NEWLINE}) SPACENLS = {SPACENL}+ SENTEND1 = {SPACENL}({SPACENL}|[:uppercase:]|{SGML1}) SENTEND2 = {SPACE}({SPACE}|[:uppercase:]|{SGML2}) DIGIT = [:digit:]|[\u07C0-\u07C9] DATE = {DIGIT}{1,2}[\-\/]{DIGIT}{1,2}[\-\/]{DIGIT}{2,4} NUM = {DIGIT}+|{DIGIT}*([.:,\u00AD\u066B\u066C]{DIGIT}+)+ /* Now don't allow bracketed negative numbers! They have too many uses (e.g., years or times in parentheses), and having them in tokens messes up treebank parsing. NUMBER = [\-+]?{NUM}|\({NUM}\) */ NUMBER = [\-+]?{NUM} SUBSUPNUM = [\u207A\u207B\u208A\u208B]?([\u2070\u00B9\u00B2\u00B3\u2074-\u2079]+|[\u2080-\u2089]+) /* Constrain fraction to only match likely fractions. Full one allows hyphen, space, or non-breaking space between integer and fraction part, but strictTreebank3 allows only hyphen. */ FRAC = ({DIGIT}{1,4}[- \u00A0])?{DIGIT}{1,4}(\\?\/|\u2044){DIGIT}{1,4} FRAC2 = [\u00BC\u00BD\u00BE\u2153-\u215E] DOLSIGN = ([A-Z]*\$|#) /* These are cent and pound sign, euro and euro, and Yen, Lira */ DOLSIGN2 = [\u00A2\u00A3\u00A4\u00A5\u0080\u20A0\u20AC\u060B\u0E3F\u20A4\uFFE0\uFFE1\uFFE5\uFFE6] /* not used DOLLAR {DOLSIGN}[ \t]*{NUMBER} */ /* |\( ?{NUMBER} ?\)) # is for pound signs */ /* For some reason U+0237-U+024F (dotless j) isn't in [:letter:]. Recent additions? */ LETTER = ([:letter:]|{SPLET}|[\u00AD\u0237-\u024F\u02C2-\u02C5\u02D2-\u02DF\u02E5-\u02FF\u0300-\u036F\u0370-\u037D\u0384\u0385\u03CF\u03F6\u03FC-\u03FF\u0483-\u0487\u04CF\u04F6-\u04FF\u0510-\u0525\u055A-\u055F\u0591-\u05BD\u05BF\u05C1\u05C2\u05C4\u05C5\u05C7\u0615-\u061A\u063B-\u063F\u064B-\u065E\u0670\u06D6-\u06EF\u06FA-\u06FF\u070F\u0711\u0730-\u074F\u0750-\u077F\u07A6-\u07B1\u07CA-\u07F5\u07FA\u0900-\u0903\u093C\u093E-\u094E\u0951-\u0955\u0962-\u0963\u0981-\u0983\u09BC-\u09C4\u09C7\u09C8\u09CB-\u09CD\u09D7\u09E2\u09E3\u0A01-\u0A03\u0A3C\u0A3E-\u0A4F\u0A81-\u0A83\u0ABC-\u0ACF\u0B82\u0BBE-\u0BC2\u0BC6-\u0BC8\u0BCA-\u0BCD\u0C01-\u0C03\u0C3E-\u0C56\u0D3E-\u0D44\u0D46-\u0D48\u0E30-\u0E3A\u0E47-\u0E4E\u0EB1-\u0EBC\u0EC8-\u0ECD]) WORD = {LETTER}({LETTER}|{DIGIT})*([.!?]{LETTER}({LETTER}|{DIGIT})*)* FILENAME_EXT = bat|bmp|bz2|c|class|cgi|cpp|dll|doc|docx|exe|gif|gz|h|htm|html|jar|java|jpeg|jpg|mov|mp3|pdf|php|pl|png|ppt|ps|py|sql|tar|txt|wav|x|xml|zip|3gp|wm[va]|avi|flv|mov|mp[34g] FILENAME = ({LETTER}|{DIGIT})+([-._/]({LETTER}|{DIGIT})+)*([.]{FILENAME_EXT}) /* The $ was for things like New$ */ /* WAS: only keep hyphens with short one side like co-ed */ /* But treebank just allows hyphenated things as words! */ THING = ([dDoOlL]{APOSETCETERA}([:letter:]|[:digit:]))?([:letter:]|[:digit:])+({HYPHEN}([dDoOlL]{APOSETCETERA}([:letter:]|[:digit:]))?([:letter:]|[:digit:])+)* THINGA = [A-Z]+(([+&]|{SPAMP})[A-Z]+)+ THING3 = [A-Za-z0-9]+(-[A-Za-z]+){0,2}(\\?\/[A-Za-z0-9]+(-[A-Za-z]+){0,2}){1,2} APOS = ['\u0092\u2019]|' /* ASCII straight quote, single right curly quote in CP1252 (wrong) or Unicode or HTML SGML escape */ /* Includes extra ones that may appear inside a word, rightly or wrongly */ APOSETCETERA = {APOS}|[`\u0091\u2018\u201B] HTHING = ({LETTER}|{DIGIT})[A-Za-z0-9.,\u00AD]*(-([A-Za-z0-9\u00AD]+|{ACRO2}\.))+ /* from the CLEAR (biomedical?) treebank documentation */ /* we're going to split on most hypens except a few */ /* From Supplementary Guidelines for ETTB 2.0 (Justin Mott, Colin Warner, Ann Bies; Ann Taylor) */ HTHINGEXCEPTIONPREFIXED = (e|a|u|x|agro|ante|anti|arch|be|bi|bio|co|counter|cross|cyber|de|eco|ex|extra|inter|intra|macro|mega|micro|mid|mini|multi|neo|non|over|pan|para|peri|post|pre|pro|pseudo|quasi|re|semi|sub|super|tri|ultra|un|uni|vice)(-([A-Za-z0-9\u00AD]+|{ACRO2}\.))+ HTHINGEXCEPTIONSUFFIXED = ([A-Za-z0-9][A-Za-z0-9.,\u00AD]*)(-)(esque|ette|fest|fold|gate|itis|less|most|o-torium|rama|wise)(s|es|d|ed)? HTHINGEXCEPTIONWHOLE = (mm-hm|mm-mm|o-kay|uh-huh|uh-oh)(s|es|d|ed)? /* things like 'll and 'm */ REDAUX = {APOS}([msdMSD]|re|ve|ll) /* For things that will have n't on the end. They can't end in 'n' */ /* \u00AD is soft hyphen */ SWORD = [A-Za-z\u00AD]*[A-MO-Za-mo-z](\u00AD)* SREDAUX = n{APOSETCETERA}t /* Tokens you want but already okay: C'mon 'n' '[2-9]0s '[eE]m 'till? [Yy]'all 'Cause Shi'ite B'Gosh o'clock. Here now only need apostrophe final words. */ /* Note that Jflex doesn't support {2,} form. Only {2,k}. */ /* [yY]' is for Y'know, y'all and I for I. So exclude from one letter first */ /* Rest are for French borrowings. n allows n'ts in "don'ts" */ /* Arguably, c'mon should be split to "c'm" + "on", but not yet. */ APOWORD = {APOS}n{APOS}?|[lLdDjJ]{APOS}|Dunkin{APOS}|somethin{APOS}|ol{APOS}|{APOS}em|[A-HJ-XZn]{APOSETCETERA}[:letter:]{2}[:letter:]*|{APOS}[2-9]0s|{APOS}till?|[:letter:][:letter:]*[aeiouyAEIOUY]{APOSETCETERA}[aeiouA-Z][:letter:]*|{APOS}cause|cont'd\.?|'twas|nor'easter|c'mon|e'er|s'mores|ev'ry|li'l|nat'l|O{APOSETCETERA}o APOWORD2 = y{APOS} FULLURL = https?:\/\/[^ \t\n\f\r\"<>|(){}]+[^ \t\n\f\r\"<>|.!?(){},-] LIKELYURL = ((www\.([^ \t\n\f\r\"<>|.!?(){},]+\.)+[a-zA-Z]{2,4})|(([^ \t\n\f\r\"`'<>|.!?(){},-_$]+\.)+(com|net|org|edu)))(\/[^ \t\n\f\r\"<>|()]+[^ \t\n\f\r\"<>|.!?(){},-])? /* <,< should match >,>, but that's too complicated */ EMAIL = (<|<)?[a-zA-Z0-9][^ \t\n\f\r\"<>|()\u00A0{}]*@([^ \t\n\f\r\"<>|(){}.\u00A0]+\.)*([^ \t\n\f\r\"<>|(){}.\u00A0]+)(>|>)? /* Technically, names should be capped at 15 characters. However, then you get into weirdness with what happens to the rest of the characters. */ TWITTER_NAME = [@\uFF20][a-zA-Z_][a-zA-Z_0-9]* TWITTER_HASHTAG = [#\uFF03]{LETTER}({LETTER}|{DIGIT}|_)*({LETTER}|{DIGIT}) TWITTER = {TWITTER_NAME}|{TWITTER_HASHTAG} ISO8601DATETIME = [0-9]{4}-[0-9]{2}-[0-9]{2}T[0-9]{2}:[x0-9]{2}:[0-9]{2}Z? DEGREES = °[CF] /* --- This block becomes ABBREV1 and is usually followed by lower case words. --- */ /* Abbreviations - induced from 1987 WSJ by hand */ ABMONTH = Jan|Feb|Mar|Apr|Jun|Jul|Aug|Sep|Sept|Oct|Nov|Dec /* "May." isn't an abbreviation. "Jun." and "Jul." barely occur, but don't seem dangerous */ ABDAYS = Mon|Tue|Tues|Wed|Thu|Thurs|Fri /* Sat. and Sun. barely occur and can easily lead to errors, so we omit them */ /* In caseless, |a\.m|p\.m handled as ACRO, and this is better as can often be followed by capitalized. */ /* Ma. or Me. isn't included as too many errors, and most sources use Mass. etc. */ /* Fed. is tricky. Usually sentence end, but not before "Governor" or "Natl. Mtg. Assn." */ /* Make some states case sensitive, since they're also reasonably common words */ ABSTATE = Ala|Ariz|[A]z|[A]rk|Calif|Colo|Conn|Ct|Dak|[D]el|Fla|Ga|[I]ll|Ind|Kans?|Ky|[L]a|[M]ass|Md|Mich|Minn|[M]iss|Mo|Mont|Neb|Nev|Okla|[O]re|[P]a|Penn|Tenn|[T]ex|Va|Vt|[W]ash|Wisc?|Wyo /* Bhd is Malaysian companies! Rt. is Hungarian? */ /* Special case: Change the class of Pty when followed by Ltd to not sentence break (in main code below)... */ ABCOMP = Inc|Cos?|Corp|Pp?t[ye]s?|Ltd|Plc|Rt|Bancorp|Bhd|Assn|Univ|Intl|Sys /* Don't include fl. oz. since Oz turns up too much in caseless tokenizer. ft now allows upper after it for "Fort" use. */ ABNUM = tel|est|ext|sq /* p used to be in ABNUM list, but it can't be any more, since the lexer is now caseless. We don't want to have it recognized for P. Both p. and P. are now under ABBREV4. ABLIST also went away as no-op [a-e] */ ABPTIT = Jr|Sr|Bros|(Ed|Ph)\.D|Blvd|Rd|Esq /* ABBREV1 abbreviations are normally followed by lower case words. * If they're followed by an uppercase one, we assume there is also a * sentence boundary. */ ABBREV1 = ({ABMONTH}|{ABDAYS}|{ABSTATE}|{ABCOMP}|{ABNUM}|{ABPTIT}|etc|al|seq|Bldg)\. /* --- This block becomes ABBREV2 and is usually followed by upper case words. --- */ /* In the caseless world S.p.A. "Società Per Azioni (Italian: shared company)" is got as a regular acronym */ /* ACRO Is a bad case -- can go either way! */ ACRO = [A-Za-z](\.[A-Za-z])*|(Canada|Sino|Korean|EU|Japan|non)-U\.S|U\.S\.-(U\.K|U\.S\.S\.R) ACRO2 = [A-Za-z](\.[A-Za-z])+|(Canada|Sino|Korean|EU|Japan|non)-U\.S|U\.S\.-(U\.K|U\.S\.S\.R) /* ABTITLE is mainly person titles, but also Mt for mountains and Ft for Fort. */ ABTITLE = Mr|Mrs|Ms|[M]iss|Drs?|Profs?|Sens?|Reps?|Attys?|Lt|Col|Gen|Messrs|Govs?|Adm|Rev|Maj|Sgt|Cpl|Pvt|Capt|Ste?|Ave|Pres|Lieut|Hon|Brig|Co?mdr|Pfc|Spc|Supts?|Det|Mt|Ft|Adj|Adv|Asst|Assoc|Ens|Insp|Mlle|Mme|Msgr|Sfc ABCOMP2 = Invt|Elec|Natl|M[ft]g|Dept /* ABRREV2 abbreviations are normally followed by an upper case word. * We assume they aren't used sentence finally. Ph is in there for Ph. D */ ABBREV4 = {ABTITLE}|vs|[v]|Alex|Wm|Jos|Cie|a\.k\.a|cf|TREAS|Ph|{ACRO}|{ABCOMP2} ABBREV2 = {ABBREV4}\. ACRONYM = ({ACRO})\. /* Cie. is used by French companies sometimes before and sometimes at end as in English Co. But we treat as allowed to have Capital following without being sentence end. Cia. is used in Spanish/South American company abbreviations, which come before the company name, but we exclude that and lose, because in a caseless segmenter, it's too confusable with CIA. */ /* in the WSJ Alex. is generally an abbreviation for Alex. Brown, brokers! */ /* Added Wm. for William and Jos. for Joseph */ /* In tables: Mkt. for market Div. for division of company, Chg., Yr.: year */ /* --- ABBREV3 abbreviations are allowed only before numbers. --- * Otherwise, they aren't recognized as abbreviations (unless they also * appear in ABBREV1 or ABBREV2). * est. is "estimated" -- common in some financial contexts. ext. is extension, ca. is circa. * "Art(s)." is for "article(s)" -- common in legal context, Sec(t). for section(s) */ /* Maybe also "op." for "op. cit." but also get a photo op. */ ABBREV3 = (ca|figs?|prop|nos?|sect?s?|arts?|bldg|prop|pp|op)\. /* Case for south/north before a few places. */ ABBREVSN = So\.|No\. /* See also a couple of special cases for pty. in the code below. */ /* phone numbers. keep multi dots pattern separate, so not confused with decimal numbers. */ PHONE = (\([0-9]{2,3}\)[ \u00A0]?|(\+\+?)?([0-9]{2,4}[\- \u00A0])?[0-9]{2,4}[\- \u00A0])[0-9]{3,4}[\- \u00A0]?[0-9]{3,5}|((\+\+?)?[0-9]{2,4}\.)?[0-9]{2,4}\.[0-9]{3,4}\.[0-9]{3,5} /* Fake duck feet appear sometimes in WSJ, and aren't likely to be SGML, less than, etc., so group. */ FAKEDUCKFEET = <<|>> LESSTHAN = <|< GREATERTHAN = >|> HYPHEN = [-_\u058A\u2010\u2011] HYPHENS = \-+ LDOTS = \.\.\.+|[\u0085\u2026] SPACEDLDOTS = \.[ \u00A0](\.[ \u00A0])+\. ATS = @+ UNDS = _+ ASTS = \*+|(\\\*){1,3} HASHES = #+ FNMARKS = {ATS}|{HASHES}|{UNDS} INSENTP = [,;:\u3001] QUOTES = {APOS}|''|[`\u2018\u2019\u201A\u201B\u201C\u201D\u0091\u0092\u0093\u0094\u201E\u201F\u2039\u203A\u00AB\u00BB]{1,2} DBLQUOT = \"|" /* Cap'n for captain, c'est for french */ TBSPEC = -(RRB|LRB|RCB|LCB|RSB|LSB)-|C\.D\.s|pro-|anti-|S(&|&)P-500|S(&|&)Ls|Cap{APOS}n|c{APOS}est|f\*[c*]k(in[g']|e[dr])?|sh\*t(ty)? TBSPEC2 = {APOS}[0-9][0-9] BANGWORDS = (E|Yahoo|Jeopardy)\! BANGMAGAZINES = OK\! /* Smileys (based on Chris Potts' sentiment tutorial, but much more restricted set - e.g., no "8)", "do:" or "):", too ambiguous) and simple Asian smileys */ SMILEY = [<>]?[:;=][\-o\*']?[\(\)DPdpO\\{@\|\[\]] ASIANSMILEY = [\^x=~<>]\.\[\^x=~<>]|[\-\^x=~<>']_[\-\^x=~<>']|\([\-\^x=~<>'][_.]?[\-\^x=~<>']\)|\([\^x=~<>']-[\^x=~<>'`]\) /* U+2200-U+2BFF has a lot of the various mathematical, etc. symbol ranges */ MISCSYMBOL = [+%&~\^|\\¦\u00A7¨\u00A9\u00AC\u00AE¯\u00B0-\u00B3\u00B4-\u00BA\u00D7\u00F7\u0387\u05BE\u05C0\u05C3\u05C6\u05F3\u05F4\u0600-\u0603\u0606-\u060A\u060C\u0614\u061B\u061E\u066A\u066D\u0703-\u070D\u07F6\u07F7\u07F8\u0964\u0965\u0E4F\u1FBD\u2016\u2017\u2020-\u2023\u2030-\u2038\u203B\u2043\u203E-\u2042\u2044\u207A-\u207F\u208A-\u208E\u2100-\u214F\u2190-\u21FF\u2200-\u2BFF\u3001-\u3006\u3008-\u3020\u30FB\uFF01-\uFF0F\uFF1A-\uFF20\uFF3B-\uFF40\uFF5B-\uFF65\uFF65] /* \uFF65 is Halfwidth katakana middle dot; \u30FB is Katakana middle dot */ /* Math and other symbols that stand alone: °²× ∀ */ // Consider this list of bullet chars: 2219, 00b7, 2022, 2024 %% c[+][+] { return getNext(); } (c|f)# { return getNext(); } cannot { if (splitAssimilations) { yypushback(3) ; return getNext(); } else { return getNext(); } } 'twas { if (splitAssimilations) { yypushback(3) ; return getNext(); } else { return getNext(); } } 'tis { if (splitAssimilations) { yypushback(2) ; return getNext(); } else { return getNext(); } } gonna|gotta|lemme|gimme|wanna { if (splitAssimilations) { yypushback(2) ; return getNext(); } else { return getNext(); } } dunno { if (splitAssimilations) { yypushback(3) ; return getNext(); } else { return getNext(); } } /* Remnant after pushing back from dunno */ nno/[^A-Za-z0-9] { if (splitAssimilations) { yypushback(2) ; return getNext(); } else { return getNext(); } } {SGML1} { final String origTxt = yytext(); String txt = origTxt; if (normalizeSpace) { txt = SINGLE_SPACE_PATTERN.matcher(txt).replaceAll("\u00A0"); // change to non-breaking space } return getNext(txt, origTxt); } {SGML2} { final String origTxt = yytext(); String txt = origTxt; if (normalizeSpace) { txt = txt.replace(' ', '\u00A0'); // change space to non-breaking space } return getNext(txt, origTxt); } {SPMDASH} { if (ptb3Dashes) { return getNext(ptbmdash, yytext()); } else { return getNext(); } } {SPAMP} { return getNormalizedAmpNext(); } {SPPUNC} { return getNext(); } {WORD}/{REDAUX} { final String origTxt = yytext(); String tmp = removeSoftHyphens(origTxt); if (americanize) { tmp = Americanize.americanize(tmp); } return getNext(tmp, origTxt); } {SWORD}/{SREDAUX} { final String txt = yytext(); return getNext(removeSoftHyphens(txt), txt); } {WORD} { final String origTxt = yytext(); String tmp = removeSoftHyphens(origTxt); if (americanize) { tmp = Americanize.americanize(tmp); } return getNext(tmp, origTxt); } {APOWORD} { return handleQuotes(yytext(), false); } {APOWORD2}/[:letter:] { return getNext(); } {FULLURL} { String txt = yytext(); if (escapeForwardSlashAsterisk) { txt = delimit(txt, '/'); txt = delimit(txt, '*'); } return getNext(txt, yytext()); } {LIKELYURL} { String txt = yytext(); if (escapeForwardSlashAsterisk) { txt = delimit(txt, '/'); txt = delimit(txt, '*'); } return getNext(txt, yytext()); } {EMAIL} { return getNext(); } {TWITTER} { return getNext(); } {REDAUX}/[^A-Za-z] { return handleQuotes(yytext(), false); } {SREDAUX}/[^A-Za-z] { return handleQuotes(yytext(), false); } {DATE} { String txt = yytext(); if (escapeForwardSlashAsterisk) { txt = delimit(txt, '/'); } return getNext(txt, yytext()); } {NUMBER} { return getNext(removeSoftHyphens(yytext()), yytext()); } {SUBSUPNUM} { return getNext(); } {FRAC} { String txt = yytext(); // if we are in strictTreebank3 mode, we need to reject everything after a space or non-breaking space... if (strictTreebank3) { int spaceIndex = indexOfSpace(txt); if (spaceIndex >= 0) { yypushback(txt.length() - spaceIndex); String newText = yytext(); return getNext(newText, newText); } } if (escapeForwardSlashAsterisk) { txt = delimit(txt, '/'); } if (normalizeSpace) { txt = txt.replace(' ', '\u00A0'); // change space to non-breaking space } return getNext(txt, txt); } {FRAC2} { return normalizeFractions(yytext()); } {TBSPEC} { return getNormalizedAmpNext(); } {BANGWORDS} { return getNext(); } {BANGMAGAZINES}/{SPACENL}magazine { return getNext(); } {BANGMAGAZINES}/{SPACE}magazine { return getNext(); } {THING3} { if (escapeForwardSlashAsterisk) { return getNext(delimit(yytext(), '/'), yytext()); } else { return getNext(); } } {DOLSIGN} { return getNext(); } {DOLSIGN2} { if (normalizeCurrency) { return getNext(normalizeCurrency(yytext()), yytext()); } else { return getNext(); } } /* Any acronym can be treated as sentence final iff followed by this list of words (pronouns, determiners, and prepositions, etc.). "U.S." is the single big source of errors. Character classes make this rule case sensitive! (This is needed!!). A one letter acronym candidate like "Z." or "I." in this context usually isn't, and so we return the leter and pushback the period for next time. */ {ACRONYM}/({SPACENLS})([A]bout|[A]ccording|[A]dditionally|[A]fter|[A]n|[A]|[A]s|[A]t|[B]ut|[D]id|[D]uring|[E]arlier|[H]e|[H]er|[H]ere|[H]ow|[H]owever|[I]f|[I]n|[I]t|[L]ast|[M]any|[M]ore|[M]r\.|[M]s\.|[N]ow|[O]nce|[O]ne|[O]ther|[O]ur|[S]he|[S]ince|[S]o|[S]ome|[S]uch|[T]hat|[T]he|[T]heir|[T]hen|[T]here|[T]hese|[T]hey|[T]his|[W]e|[W]hen|[W]hile|[W]hat|[W]ho|[W]hy|[Y]et|[Y]ou|{SGML1})({SPACENL}|[?!]) { return processAcronym(); } {ACRONYM}/({SPACES})([A]bout|[A]ccording|[A]dditionally|[A]fter|[A]n|[A]|[A]s|[A]t|[B]ut|[D]id|[D]uring|[E]arlier|[H]e|[H]er|[H]ere|[H]ow|[H]owever|[I]f|[I]n|[I]t|[L]ast|[M]any|[M]ore|[M]r\.|[M]s\.|[N]ow|[O]nce|[O]ne|[O]ther|[O]ur|[S]he|[S]ince|[S]o|[S]ome|[S]uch|[T]hat|[T]he|[T]heir|[T]hen|[T]here|[T]hese|[T]hey|[T]his|[W]e|[W]hen|[W]hile|[W]hat|[W]ho|[W]hy|[Y]et|[Y]ou|{SGML1})({SPACE}|[?!]) { return processAcronym(); } /* Special case to get ca., fig. or Prop. before numbers */ {ABBREV3}/{SPACENL}?[:digit:] { return processAbbrev3(); } {ABBREV3}/{SPACENL}?[:digit:] { return processAbbrev3(); } {ABBREVSN}/{SPACENL}+(Africa|Korea|Cal) { return getNext(); } {ABBREVSN}/{SPACE}+(Africa|Korea|Cal) { return getNext(); } /* Special case to get pty. ltd. or pty limited. Also added "Co." since someone complained, but usually a comma after it. */ (pt[eyEY]|co)\./{SPACE}(ltd|lim) { return getNext(); } {ABBREV1}/{SENTEND1} { return processAbbrev1(); } {ABBREV1}/{SENTEND2} { return processAbbrev1(); } {ABBREV1}/[^][^] { return getNext(); } {ABBREV1}/[^\r\n][^\r\n] { return getNext(); } {ABBREV1} { // this one should only match if we're basically at the end of file // since the last one matches two things, even newlines (if not tokenize per line) String s; if (strictTreebank3 && ! "U.S.".equals(yytext())) { yypushback(1); // return a period for next time s = yytext(); } else { s = yytext(); yypushback(1); // return a period for next time } return getNext(s, yytext()); } {ABBREV2} { return getNext(); } {ABBREV4}/{SPACE} { return getNext(); } {ACRO}/{SPACENL} { return getNext(); } {TBSPEC2}/{SPACENL} { return getNext(); } {ISO8601DATETIME} { return getNext(); } {DEGREES} { return getNext(); } {FILENAME}/({SPACENL}|[.?!,\"']) { return getNext(); } {FILENAME}/({SPACE}|[.?!,\"']) { return getNext(); } {WORD}\./{INSENTP} { return getNext(removeSoftHyphens(yytext()), yytext()); } {PHONE} { String txt = yytext(); if (normalizeSpace) { txt = txt.replace(' ', '\u00A0'); // change space to non-breaking space } if (normalizeParentheses) { txt = LEFT_PAREN_PATTERN.matcher(txt).replaceAll(openparen); txt = RIGHT_PAREN_PATTERN.matcher(txt).replaceAll(closeparen); } return getNext(txt, yytext()); } {DBLQUOT}/[A-Za-z0-9$] { return handleQuotes(yytext(), true); } {DBLQUOT} { return handleQuotes(yytext(), false); } \x7f { if (invertible) { prevWordAfter.append(yytext()); } } {LESSTHAN} { return getNext("<", yytext()); } {GREATERTHAN} { return getNext(">", yytext()); } {SMILEY}/[^A-Za-z0-9] { String txt = yytext(); String origText = txt; if (normalizeParentheses) { txt = LEFT_PAREN_PATTERN.matcher(txt).replaceAll(openparen); txt = RIGHT_PAREN_PATTERN.matcher(txt).replaceAll(closeparen); } return getNext(txt, origText); } {ASIANSMILEY} { String txt = yytext(); String origText = txt; if (normalizeParentheses) { txt = LEFT_PAREN_PATTERN.matcher(txt).replaceAll(openparen); txt = RIGHT_PAREN_PATTERN.matcher(txt).replaceAll(closeparen); } return getNext(txt, origText); } \{ { if (normalizeOtherBrackets) { return getNext(openbrace, yytext()); } else { return getNext(); } } \} { if (normalizeOtherBrackets) { return getNext(closebrace, yytext()); } else { return getNext(); } } \[ { if (normalizeOtherBrackets) { return getNext("-LSB-", yytext()); } else { return getNext(); } } \] { if (normalizeOtherBrackets) { return getNext("-RSB-", yytext()); } else { return getNext(); } } \( { if (normalizeParentheses) { return getNext(openparen, yytext()); } else { return getNext(); } } \) { if (normalizeParentheses) { return getNext(closeparen, yytext()); } else { return getNext(); } } {HYPHENS} { if (yylength() >= 3 && yylength() <= 4 && ptb3Dashes) { return getNext(ptbmdash, yytext()); } else { return getNext(); } } {LDOTS}/\.{SPACENLS}[:letter:] { /* attempt to treat fourth ellipsis as period if followed by space and letter. */ return handleEllipsis(yytext()); } {LDOTS}/\.{SPACES}[:letter:] { /* attempt to treat fourth ellipsis as period if followed by space and letter. */ return handleEllipsis(yytext()); } {SPACEDLDOTS}/{SPACE}\.{SPACENLS}[:letter:] { /* attempt to treat fourth ellipsis as period if followed by space and letter. */ return handleEllipsis(yytext()); } {SPACEDLDOTS}/{SPACE}\.{SPACES}[:letter:] { /* attempt to treat fourth ellipsis as period if followed by space and letter. */ return handleEllipsis(yytext()); } {LDOTS}|{SPACEDLDOTS} { return handleEllipsis(yytext()); } {FNMARKS} { return getNext(); } {ASTS} { if (escapeForwardSlashAsterisk) { return getNext(delimit(yytext(), '*'), yytext()); } else { return getNext(); } } {INSENTP} { return getNext(); } [?!]+ { return getNext(); } [.¡¿\u037E\u0589\u061F\u06D4\u0700-\u0702\u07FA\u3002] { return getNext(); } =+ { return getNext(); } \/ { if (escapeForwardSlashAsterisk) { return getNext(delimit(yytext(), '/'), yytext()); } else { return getNext(); } } /* {HTHING}/[^a-zA-Z0-9.+] { return getNext(removeSoftHyphens(yytext()), yytext()); } */ {HTHINGEXCEPTIONWHOLE} {return getNext(removeSoftHyphens(yytext()), yytext());} {HTHINGEXCEPTIONWHOLE}\./{INSENTP} {return getNext(removeSoftHyphens(yytext()), yytext());} {HTHINGEXCEPTIONPREFIXED} {return getNext(removeSoftHyphens(yytext()), yytext());} {HTHINGEXCEPTIONPREFIXED}\./{INSENTP} {return getNext(removeSoftHyphens(yytext()), yytext());} {HTHINGEXCEPTIONSUFFIXED} {return getNext(removeSoftHyphens(yytext()), yytext());} {HTHINGEXCEPTIONSUFFIXED}\./{INSENTP} {return getNext(removeSoftHyphens(yytext()), yytext());} {HTHING} { breakByHyphens(yytext()); return getNext(removeSoftHyphens(yytext()), yytext()); } {HTHING}\./{INSENTP} { breakByHyphens(yytext()); return getNext(removeSoftHyphens(yytext()), yytext()); } {THING}\./{INSENTP} { return handleQuotes(yytext(), false); } /* A THING can contain quote like O'Malley */ {THING} { return handleQuotes(yytext(), false); } {THINGA}\./{INSENTP} { return getNormalizedAmpNext(); } {THINGA} { return getNormalizedAmpNext(); } '/[A-Za-z][^ \t\n\r\u00A0] { /* invert quote - often but not always right */ return handleQuotes(yytext(), true); } /* This REDAUX is needed is needed in case string ends on "it's". See: testJacobEisensteinApostropheCase */ {REDAUX} { return handleQuotes(yytext(), false); } {SREDAUX} { return handleQuotes(yytext(), false); } {QUOTES} { return handleQuotes(yytext(), false); } {FAKEDUCKFEET} { return getNext(); } {MISCSYMBOL} { return getNext(); } \u0095 { return getNext("\u2022", yytext()); } /* cp1252 bullet mapped to unicode */ \u0099 { return getNext("\u2122", yytext()); } /* cp1252 TM sign mapped to unicode */ \0|{SPACES}|[\u200B\u200E-\u200F\uFEFF] { if (invertible) { prevWordAfter.append(yytext()); } } {NEWLINE} { if (tokenizeNLs) { return getNext(NEWLINE_TOKEN, yytext()); // js: for tokenizing carriage returns } else if (invertible) { prevWordAfter.append(yytext()); } }   { if (invertible) { prevWordAfter.append(yytext()); } } . { String str = yytext(); int first = str.charAt(0); String msg = String.format("Untokenizable: %s (U+%s, decimal: %s)", yytext(), Integer.toHexString(first).toUpperCase(), Integer.toString(first)); switch (untokenizable) { case NONE_DELETE: if (invertible) { prevWordAfter.append(str); } break; case FIRST_DELETE: if (invertible) { prevWordAfter.append(str); } if ( ! this.seenUntokenizableCharacter) { logger.warning(msg); this.seenUntokenizableCharacter = true; } break; case ALL_DELETE: if (invertible) { prevWordAfter.append(str); } logger.warning(msg); this.seenUntokenizableCharacter = true; break; case NONE_KEEP: return getNext(); case FIRST_KEEP: if ( ! this.seenUntokenizableCharacter) { logger.warning(msg); this.seenUntokenizableCharacter = true; } return getNext(); case ALL_KEEP: logger.warning(msg); this.seenUntokenizableCharacter = true; return getNext(); } } <> { if (invertible) { prevWordAfter.append(yytext()); String str = prevWordAfter.toString(); prevWordAfter.setLength(0); prevWord.set(CoreAnnotations.AfterAnnotation.class, str); } return null; }





© 2015 - 2024 Weber Informatics LLC | Privacy Policy