Many resources are needed to download a project. Please understand that we have to compensate our server costs. Thank you in advance. Project price only 1 $
You can buy this project and download/modify it how often you want.
package edu.stanford.nlp.process;
// Stanford English Tokenizer -- a deterministic, fast high-quality tokenizer
// Copyright (c) 2002-2009 The Board of Trustees of
// The Leland Stanford Junior University. All Rights Reserved.
//
// This program is free software; you can redistribute it and/or
// modify it under the terms of the GNU General Public License
// as published by the Free Software Foundation; either version 2
// of the License, or (at your option) any later version.
//
// This program is distributed in the hope that it will be useful,
// but WITHOUT ANY WARRANTY; without even the implied warranty of
// MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
// GNU General Public License for more details.
//
// You should have received a copy of the GNU General Public License
// along with this program; if not, write to the Free Software
// Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
//
// For more information, bug reports, fixes, contact:
// Christopher Manning
// Dept of Computer Science, Gates 1A
// Stanford CA 94305-9010
// USA
// [email protected]
// http://nlp.stanford.edu/software/
import java.io.Reader;
import java.util.logging.Logger;
import java.util.Map;
import java.util.Map.Entry;
import java.util.Properties;
import java.util.Set;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.ling.CoreAnnotations.AfterAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.BeforeAnnotation;
import edu.stanford.nlp.ling.CoreAnnotations.OriginalTextAnnotation;
import edu.stanford.nlp.util.StringUtils;
/** Provides a tokenizer or lexer that does a pretty good job at
* deterministically tokenizing English according to Penn Treebank conventions.
* The class is a scanner generated by
* JFlex (1.4.3) from the specification
* file
* PTBLexer.flex. As well as copying what is in the Treebank,
* it now contains some extensions to deal with modern text and encoding
* issues, such as recognizing URLs and common Unicode characters, and a
* variety of options for doing or suppressing certain normalizations.
* Although they shouldn't really be there, it also interprets certain of the
* characters between U+0080 and U+009F as Windows CP1252 characters.
*
* Fine points: Output normalized tokens should not contain spaces,
* providing the normalizeSpace option is true. The space will be turned
* into a non-breaking space (U+00A0). Otherwise, they can appear in
* a couple of token classes (phone numbers, fractions).
* The original
* PTB tokenization (messy) standard also escapes certain other characters,
* such as * and /, and normalizes things like " to `` or ''. By default,
* this tokenizer does all of those things. However, you can turn them
* off by using the ptb3Escaping=false option, or, parts of it on or off,
* or unicode
* character alternatives on with different options. You can also build an
* invertible tokenizer, with which you can still access the original
* character sequence and the non-token whitespace around it in a CoreLabel.
* And you can ask for newlines to be tokenized.
*
* Character entities: For legacy reasons, this file will parse and
* interpret some simply SGML/XML/HTML character entities. For modern formats
* like XML, you are better off doing XML parsing, and then running the
* tokenizer on CDATA elements. But we and others frequently work with simple
* SGML text corpora that are not XML (like LDC text collections). In practice,
* they only include very simple markup and a few simple entities, and the
* combination of the -parseInside option and the minimal character entity
* support in this file is enough to handle them. So we leave this functionality
* in, even though it could conceivably mess with a correct XML file if the
* output of decoding had things that look like character entities. In general,
* handled symbols are changed to ASCII/Unicode forms, but handled accented
* letters are just left as character entities in words.
*
* Character support: PTBLexer works works for a large subset of
* Unicode Base Multilingual Plane characters (only). It recognizes all
* characters that match the JFlex/Java [:letter:] and [:digit] character
* class (but, unfortunately, JFlex does not support most
* other Unicode character classes available in Java regular expressions).
* It also matches all defined characters in the Unicode ranges U+0000-U+07FF
* excluding control characters except the ones very standardly found in
* plain text documents. Finally select other characters commonly found in
* English unicode text are included.
*
* Implementation note: The scanner is caseless, but note, if adding
* or changing regexps, that caseless does not expand inside character
* classes. From the manual: "The %caseless option does not change the
* matched text and does not effect character classes. So [a] still only
* matches the character a and not A, too." Note that some character
* classes still deliberately don't have both cases, so the scanner's
* operation isn't completely case-independent, though it mostly is.
*
* Implementation note: This Java class is automatically generated
* from PTBLexer.flex using jflex. DO NOT EDIT THE JAVA SOURCE. This file
* has now been updated for JFlex 1.4.2+. (This required code changes: this
* version only works right with JFlex 1.4.2+; the previous version only works
* right with JFlex 1.4.1.)
*
* @author Tim Grow
* @author Christopher Manning
* @author Jenny Finkel
*/
%%
%class PTBLexer
%unicode
%function next
%type Object
%char
%caseless
%state YyStrictlyTreebank3 YyTraditionalTreebank3
%{
/**
* Constructs a new PTBLexer. You specify the type of result tokens with a
* LexedTokenFactory, and can specify the treatment of tokens by boolean
* options given in a comma separated String
* (e.g., "invertible,normalizeParentheses=true").
* If the String is null or empty, you get the traditional
* PTB3 normalization behaviour (i.e., you get ptb3Escaping=false). If you
* want no normalization, then you should pass in the String
* "ptb3Escaping=false". The known option names are:
*
*
invertible: Store enough information about the original form of the
* token and the whitespace around it that a list of tokens can be
* faithfully converted back to the original String. Valid only if the
* LexedTokenFactory is an instance of CoreLabelTokenFactory. The
* keys used in it are TextAnnotation for the tokenized form,
* OriginalTextAnnotation for the original string, BeforeAnnotation and
* AfterAnnotation for the whitespace before and after a token, and
* perhaps BeginPositionAnnotation and EndPositionAnnotation to record
* token begin/after end offsets, if they were specified to be recorded
* in TokenFactory construction. (Like the String class, begin and end
* are done so end - begin gives the token length.)
*
tokenizeNLs: Whether end-of-lines should become tokens (or just
* be treated as part of whitespace)
*
ptb3Escaping: Enable all traditional PTB3 token transforms
* (like -LRB-, -RRB-). This is a macro flag that sets or clears all the
* options below.
*
americanize: Whether to rewrite common British English spellings
* as American English spellings
*
normalizeSpace: Whether any spaces in tokens (phone numbers, fractions
* get turned into U+00A0 (non-breaking space). It's dangerous to turn
* this off for most of our Stanford NLP software, which assumes no
* spaces in tokens.
*
normalizeAmpersandEntity: Whether to map the XML & to an
* ampersand
*
normalizeCurrency: Whether to do some awful lossy currency mappings
* to turn common currency characters into $, #, or "cents", reflecting
* the fact that nothing else appears in the old PTB3 WSJ. (No Euro!)
*
normalizeFractions: Whether to map certain common composed
* fraction characters to spelled out letter forms like "1/2"
*
normalizeParentheses: Whether to map round parentheses to -LRB-,
* -RRB-, as in the Penn Treebank
*
normalizeOtherBrackets: Whether to map other common bracket characters
* to -LCB-, -LRB-, -RCB-, -RRB-, roughly as in the Penn Treebank
*
asciiQuotes Whether to map quote characters to the traditional ' and "
*
latexQuotes: Whether to map to ``, `, ', '' for quotes, as in Latex
* and the PTB3 WSJ (though this is now heavily frowned on in Unicode).
* If true, this takes precedence over the setting of unicodeQuotes;
* if both are false, no mapping is done.
*
unicodeQuotes: Whether to map quotes to the range U+2018 to U+201D,
* the preferred unicode encoding of single and double quotes.
*
ptb3Ellipsis: Whether to map ellipses to ..., the old PTB3 WSJ coding
* of an ellipsis. If true, this takes precedence over the setting of
* unicodeEllipsis; if both are false, no mapping is done.
*
unicodeEllipsis: Whether to map dot and optional space sequences to
* U+2026, the Unicode ellipsis character
*
ptb3Dashes: Whether to turn various dash characters into "--",
* the dominant encoding of dashes in the PTB3 WSJ
*
escapeForwardSlashAsterisk: Whether to put a backslash escape in front
* of / and * as the old PTB3 WSJ does for some reason (something to do
* with Lisp readers??).
*
untokenizable: What to do with untokenizable characters (ones not
* known to the tokenizers). Six options combining whether to log a
* warning for none, the first, or all, and whether to delete them or
* to include them as single character tokens in the output: noneDelete,
* firstDelete, allDelete, noneKeep, firstKeep, allKeep.
* The default is "firstDelete".
*
strictTreebank3: PTBTokenizer deliberately deviates from strict PTB3
* WSJ tokenization in two cases. Setting this improves compatibility
* for those cases. They are: (i) When an acronym is followed by a
* sentence end, such as "U.S." at the end of a sentence, the PTB3
* has tokens of "U.S" and ".", while by default PTBTokenzer duplicates
* the period returning tokens of "U.S." and ".", and (ii) PTBTokenizer
* will return numbers with a whole number and a fractional part like
* "5 7/8" as a single token (with a non-breaking space in the middle),
* while the PTB3 separates them into two tokens "5" and "7/8".
*
*
* @param r The Reader to tokenize text from
* @param tf The LexedTokenFactory that will be invoked to convert
* each substring extracted by the lexer into some kind of Object
* (such as a Word or CoreLabel).
*/
public PTBLexer(Reader r, LexedTokenFactory tf, String options) {
this(r);
this.tokenFactory = tf;
if (options == null) {
options = "";
}
Properties prop = StringUtils.stringToProperties(options);
Set> props = prop.entrySet();
for (Map.Entry