edu.stanford.nlp.parser.lexparser.package-info Maven / Gradle / Ivy

Show more of this group Show more artifacts with this name
Show all versions of stanford-parser Show documentation
Stanford Parser processes raw text in English, Chinese, German, Arabic, and French, and extracts constituency parse trees.
There is a newer version: 3.9.2
Show newest version
/**
 * 
 * This package contains implementations of three probabilistic parsers for
 * natural language text.  There is an accurate unlexicalized probabilistic
 * context-free grammar (PCFG) parser, a probabilistic lexical dependency parser,
 * and a factored, lexicalized
 * probabilistic context free grammar parser, which does joint inference
 * over the product of the first two parsers.  The parser supports various
 * languages and input formats.
 * For English, for most purposes, we now recommend just using the unlexicalized PCFG.
 * With a well-engineered grammar (as supplied for English), it is
 * fast, accurate, requires much less memory, and in many real-world uses,
 * lexical preferences are
 * unavailable or inaccurate across domains or genres and the
 * unlexicalized parser will
 * perform just as well as a lexicalized parser.  However, the
 * factored parser will sometimes provide greater accuracy on English
 * through
 * knowledge of lexical dependencies.  Moreover, it is considerably better than the
 * PCFG parser alone for most other languages (with less rigid word
 * order), including German, Chinese, and Arabic.  The dependency parser
 * can be run alone, but this is
 * usually not useful (its accuracy is much lower). The output
 * of the parser can be presented in various forms, such as just part-of-speech
 * tags, phrase structure trees, or dependencies, and is controlled by options
 * passed to the TreePrint class.
 * 
 * References
 * 
 * The factored parser and the unlexicalized PCFG parser are described in:
 * 

 * Dan Klein and Christopher D. Manning. 2002. Fast Exact Inference with a
 * Factored Model for Natural Language Parsing. Advances
 * in Neural Information Processing Systems 15 (NIPS 2002).
 * [pdf]
 * Dan Klein and Christopher D. Manning. 2003. Accurate
 * Unlexicalized Parsing. Proceedings of the Association for
 * 	  Computational Linguistics, 2003.
 * [pdf]
 * 
 * 
 * The factored parser uses a product model, where the preferences of an
 * unlexicalized PCFG parser and a lexicalized dependency parser are
 * combined by a third parser, which does exact search using
 * A* outside estimates (which are Viterbi outside scores,
 * precalculated during PCFG and dependency parsing of the sentence).
 * 
 * 
 * We have been splitting up the parser into public classes, but some of
 * the internals are still contained in the file
 * FactoredParser.java.
 * 
 * 
 * The class LexicalizedParser provides an interface for
 * either
 * training a parser from a treebank, or parsing text using a saved
 * parser.  It can be called programmatically, or the commandline main()
 * method supports many options.
 * 
 * 
 * The parser has been ported to multiple languages.  German, Chinese, and Arabic
 * grammars are included.  The first publication below documents the
 * Chinese parser.  The German parser was developed for and used in the
 * second paper (but the paper contains very little detail on it).
 * 
 * Roger Levy and Christopher D. Manning. 2003. Is it harder to
 * parse Chinese, or the Chinese Treebank?  ACL 2003, pp. 439-446.
 * Roger Levy and Christopher D. Manning. 2004. Deep dependencies from
 * context-free statistical parsers: correcting the surface dependency
 * approximation. ACL 2004, pp. 328-335.
 * 
 * 
 * The grammatical relations output of the parser is presented in:
 * 
 * 
 * Marie-Catherine de Marneffe, Bill MacCartney and Christopher
 * D. Manning. 2006. Generating Typed Dependency Parses from Phrase Structure
 * Parses.  LREC 2006.
 * 
 * End user usage
 * Requirements
 * You need Java 1.6+ installed on your system, and
 * java in your PATH where commands are looked for.
 * 
 * You need a machine with a fair amount of memory.  Required memory
 * depends on the choice of parser, the size of the grammar, and
 * other factors like the presence of numerous unknown words
 * To run the PCFG parser
 * on sentences of up to 40 words you need 100 MB of memory.  To be
 * able to handle longer sentences, you need more (to parse sentences
 * up to 100 words, you need 400 MB).  For running the
 * Factored Parser, 600 MB is needed for dealing with sentences
 * up to 40 words.  Factored parsing of sentences up to 200 words
 * requires around 3GB of memory.
 * Training a new lexicalized parser requires about 1500m of memory;
 * much less is needed for training a PCFG.
 * 
 * 
 * For just parsing text, you need a saved parser model (grammars, lexicon,
 * etc.), which can be
 * represented either as a text file or as a binary (Java serialized
 * object) representation, and which can be gzip compressed.
 * A number are provided contained in the supplied
 * stanford-parser-$VERSION-models.jar file in the distributed version,
 * and can be accessed from there by having this jar file on your
 * CLASSPATH and specifying them via a classpath entry such as:
 * edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz.
 * (Stanford NLP people can also find the grammars in the directory
 * /u/nlp/data/lexparser.)  Other available grammars include
 * englishFactored.ser.gz for English, and
 * chineseFactored.ser.gz for Chinese.
 * 
 * 
 * You need the parser code and grammars
 * accessible.  This can be done by having the supplied jar files on
 * your CLASSPATH.  The examples below assume you are in the parser
 * distribution home directory. From there you can set up the classpath with the
 * command-line argument  -cp "*" (or perhaps  -cp "*;"
 * on certain versions of Windows).
 * Then if you have some sentences in testsent.txt (as plain
 * text), the following commands should work.
 * 
 * Command-line parsing usage
 * Parsing a local text file:
 * 
 * java -mx100m -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser
 * edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz testsent.txt
 * 
 * 
 * Parsing a document over the web:
 * 
 * java -mx100m -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser
 * -maxLength 40 edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz http://nlp.stanford.edu/software/lex-parser.shtml
 * Note the -maxLength flag: this will set a maximum length
 * sentence to parse.  If you do not set one, the parser will try to parse
 * sentences up to any length, but will usually run out of memory when
 * trying to do this.  This is important with web pages with text that may
 * not be real sentences (or just with technical documents that turn out to
 * have 300 word sentences).
 * The parser just does very rudimentary stripping of HTML tags, and
 * so it'll work okay on plain text web pages, but it won't work
 * adequately on most complex commercial script-driven pages.  If you
 * want to handle these, you'll need to provide your own preprocessor,
 * and then to call the parser on its output.
 * The parser will send parse trees to stdout and other
 * information on what it is doing to stderr, so one commonly
 * wants to direct just stdout to an output file, in the
 * standard way.
 * Other languages: Chinese
 * Parsing a Chinese sentence (in the default input encoding for
 * Chinese of GB18030
 * - note you'll need the right fonts to see the output correctly):
 * 
 * java -mx100m -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -tLPP
 * edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams
 * edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz chinese-onesent
 * 
 * or for Unicode (UTF-8) format files:
 * 
 * java -mx100m -cp "*"edu.stanford.nlp.parser.lexparser.LexicalizedParser -tLPP
 * edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams
 * -encoding UTF-8 edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz chinese-onesent-utf
 * 
 * 
 * For Chinese, the package includes two simple word segmenters.  One is a
 * lexicon-based maximum match segmenter, and the other uses the parser to
 * do Hidden Markov Model-based word segmentation.  These segmentation
 * methods are okay, but if you would like a high quality segmentation of
 * Chinese text, you will have to segment the Chinese by yourself as a
 * preprocessing step.  The supplied grammars assume that
 * Chinese input has already been word-segmented according to Penn
 * Chinese Treebank conventions.  Choosing
 * Chinese with -tLPP
 * edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams
 * makes space-separated words the default tokenization.
 * To do word segmentation within the parser, give one of the options
 * -segmentMarkov or -segmentMaxMatch.
 * 
 * Other languages
 * 
 * The parser also supports other languages including German and French.
 * 
 * Command-line options
 * The program has many options.  The most useful end-user option is
 * -maxLength n which determines the maximum
 * length sentence that the parser will parser.  Longer sentences are
 * skipped, with a message printed to stderr.
 * Input formatting and tokenization options
 * The parser supports many different input formats: tokenized/not,
 * sentences/not, and tagged/not.
 * The input may be
 * tokenized or not, and users may supply their own tokenizers. The input
 * is by default assumed to not be tokenized; if the
 * input is tokenized, supply the option -tokenized. If the
 * input is not tokenized, you may supply the name of a tokenizer class
 * with -tokenizer tokenizerClassName; otherwise the default
 * tokenizer (edu.stanford.nlp.processor.PTBTokenizer) is
 * used.  This tokenizer should perform well over typical plain
 * newswire-style text.
 * 
The
 * input may have already been split into sentences or not. The input is by
 * default assumed
 * to be not split; if sentences are split, supply the option
 * -sentences delimitingToken, where the delimiting token
 * may be any string.  As a special case, if the delimiting token
 * is "newline" the parser will assume that each line of the
 * file is a sentence.
 * Simple XML can also be parsed.  The main method does not incorporate an XML
 * parser, but one can fake certain simple cases with the
 * -parseInside regex which will only parse the tokens inside
 * elements matched by the regular expression regex.  These
 * elements are assumed to be pure CDATA.
 * If you use -parseInside s, then the parser will accept
 * input in which sentences are marked XML-style with
 * <s> ... </s> (the same format as the input to
 * Eugene Charniak's parser).
 * 
 * Finally, the input may be tagged or not. If it is tagged, the program
 * assumes that words and tags are separated by a non-whitespace
 * separating character such as '/' or '_'. You give the option
 * -tagSeparator tagSeparator to specify tagged text with a
 * tag separator. You also need to tell the parser to use a different
 * tokenizer, using the flags
 * -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer
 * -tokenizerMethod newCoreLabelTokenizerFactory
 * 
 * You can see examples of many of these options in the
 * test directory. As an example, you can parse the example file with partial POS-tagging
 * with this command:
 * 
 * java edu.stanford.nlp.parser.lexparser.LexicalizedParser -maxLength 20 -sentences newline -tokenized -tagSeparator / -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerMethod newCoreLabelTokenizerFactory englishPCFG.ser.gz pos-sentences.txt
 * 
 * There are some restrictions on the interpretation of POS-tagged input:
 * 
 * The tagset must match the parser POS set.  If you are using our
 * supplied parser data files, that means you must be using Penn Treebank
 * POS tags.
 * 
An indicated tagging will determine which of the taggings allowed by
 * the lexicon
 * will be used, but the parser will not accept tags not allowed by its
 * lexicon.  This is usually not problematic, since rare or unknown words
 * are allowed to have many POS tags, but would be if you were trying to
 * persuade it that are should be tagged as a noun in the sentence "100
 * are make up one hectare." since it will only allow are to
 * 	have a verbal tagging.
 * 
 * 
 * For the examples in pos-sentences.txt:
 * 
 *  This sentence is parsed correctly with no tags given.
 * 
 So it is also parsed correctly telling the parser butter is a verb.
 * 
 You get a different worse parse telling it butter is a noun.
 * 
 You get the same parse as 1. with all tags correctly supplied.
 * 
 It won't accept can as a VB, but does accept butter
 * 	as a noun, so you get the same parse as 3.
 * 
 People can butter can be an NP.
 * 
 Most words can be NN, but not common function words like their,
 * 	  with, a.
 * 
 * 
 * Note that if the program is reading tags correctly, they aren't
 * printed in the
 * sentence it says it is parsing.  Only the words are printed there.
 * 
 * Output formatting options
 * You can set how sentences are printed out by using the
 * -outputFormat format option.  The native and default format is as
 * trees are formatted in the Penn Treebank, but there are a number of
 * other useful options:
 * 
 * 
 * penn The default.
 * oneline Printed out on one line.
 * wordsAndTags Use the parser as a POS tagger.
 * latexTree Help write your LaTeX papers (for use with
 * Avery Andrews' trees.sty package.
 * typedDependenciesCollapsed Write sentences in a typed
 * dependency format that represents sentences via grammatical relations
 * between words.  Suitable for representing text as a semantic network.
 * 
 * 
 * You can get each sentence printed in multiple formats by giving a
 * comma-separated list of formats.  See the TreePrint class for more
 * information on available output formats and options.
 * 
 * Programmatic usage
 * LexicalizedParser can be easily called
 * within a larger
 * application.  It implements a couple of useful interfaces that
 * provide for simple use:
 * edu.stanford.nlp.parser.ViterbiParser
 * and edu.stanford.nlp.process.Function.
 * The following simple class shows typical usage:
 *  * import java.util.*;
 * import edu.stanford.nlp.ling.*;
 * import edu.stanford.nlp.trees.*;
 * import edu.stanford.nlp.parser.lexparser.LexicalizedParser;
 * class ParserDemo {
 * public static void main(String[] args) {
 * LexicalizedParser lp = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz");
 * lp.setOptionFlags(new String[]{"-maxLength", "80", "-retainTmpSubcategories"});
 * String[] sent = { "This", "is", "an", "easy", "sentence", "." };
 * List<CoreLabel> rawWords = Sentence.toCoreLabelList(sent);
 * Tree parse = lp.apply(rawWords);
 * parse.pennPrint();
 * System.out.println();
 * TreebankLanguagePack tlp = new PennTreebankLanguagePack();
 * GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
 * GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
 * List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();
 * System.out.println(tdl);
 * System.out.println();
 * TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
 * tp.printTree(parse);
 * }
 * }
 * 
 * In a usage such as this, the parser expects sentences already
 * tokenized according to Penn Treebank conventions.  For arbitrary text,
 * prior processing must be done to achieve such tokenization (the
 * main method of LexicalizedParser provides an
 * example of doing this).  The example shows how most command-line
 * arguments can also be passed to the parser when called
 * programmatically. Note that using the
 * -retainTmpSubcategories option is necessary to get the best
 * results in the typed dependencies output recognizing temporal noun phrases
 * ("last week", "next February").
 * 
 * Some code fragments which include tokenization using Penn Treebank conventions follows:
 * 
 *  * import java.io.StringReader;
 * import edu.stanford.nlp.trees.Tree;
 * import edu.stanford.nlp.objectbank.TokenizerFactory;
 * import edu.stanford.nlp.process.CoreLabelTokenFactory;
 * import edu.stanford.nlp.ling.CoreLabel;
 * import edu.stanford.nlp.process.PTBTokenizer;
 * import edu.stanford.nlp.parser.lexparser.LexicalizedParser;
 * LexicalizedParser lp = LexicalizedParser.loadModel("englishPCFG.ser.gz");
 * lp.setOptionFlags(new String[]{"-outputFormat", "penn,typedDependenciesCollapsed", "-retainTmpSubcategories"});
 * TokenizerFactory<CoreLabel> tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), "");
 * public Tree processSentence(String sentence) {
 * List<CoreLabel> rawWords = tokenizerFactory.getTokenizer(new StringReader(sentence)).tokenize();
 * Tree bestParse = lp.parseTree(rawWords);
 * return bestParse;
 * }
 * 
 * 
 * Writing and reading trained parsers to and from files
 * A trained parser consists of grammars, a lexicon, and option values. Once
 * a parser has been trained, it may be written to file in one of two
 * formats: binary serialized Java objects or human readable text data. A parser
 * can also be quickly reconstructed (either programmatically or at the command line)
 * from files containing a parser in either of these formats.
 * The binary serialized Java
 * objects format is created using standard tools provided by the java.io
 * package, and is not text, and not human-readable. To train and then save a parser
 * as a binary serialized objects file, use a command line invocation of the form:
 * 
 * java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser
 * -train trainFilePath [fileRange] -saveToSerializedFile outputFilePath
 * 
 * The text data format is human readable and modifiable, and consists of
 * four sections, appearing in the following order:
 * 
 * Options - consists of variable-value pairs, one per line, which must remain constant across training and parsing.
 * Lexicon - consists of lexical entries, one per line, each of which is preceded by the keyword SEEN or UNSEEN, and followed by a raw count.
 * Unary Grammar - consists of unary rewrite rules, one per line, each of which is of the form A -> B, followed by the normalized log probability.
 * Binary Grammar - consists of binary rewrite rules, one per line, each of which is of the form A -> B C, followed by the normalized log probability.
 * Dependency Grammar
 * 
 * Each section is headed by a line consisting of multiple asterisks (*) and the name
 * of the section. Note that the file format does not support rules of arbitrary arity,
 * only binary and unary rules. To train and then save a parser
 * as a text data file, use a command line invocation of the form:
 * 
 * java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser
 * -train trainFilePath start stop -saveToTextFile outputFilePath
 * 
 * To parse a file with a saved parser, either in text data or serialized data format, use a command line invocation of the following form:
 * 
 * java -mx500m edu.stanford.nlp.parser.lexparser.LexicalizedParser
 * parserFilePath test.txt
 * 
 * A Note on Text Grammars
 * If you want to use the text grammars in another parser and duplicate our
 * performance, you will need to know how we handle the POS tagging of rare
 * and unknown words:
 * 
 * Unknown Words: rather than scoring all words unseen during
 * training with a single distribution over tags, we score unknown
 * words based on their word shape signatures, defined as
 * follows. Beginning with the original string, all lowercase
 * alphabetic characters are replaced with x, uppercase with X, digits
 * with d, and other characters are unchanged. Then, consecutive
 * duplicates are eliminated. For example, Formula-1 would become
 * Xx-1. The probability of tags given signatures is estimated on words
 * occurring in only the second half of the training data, then
 * inverted. However, in the current release of the parser, this is all
 * done programmatically, and so the text lexicon contains only a single
 * UNK token. To duplicate our behavior, one would be best off building
 * one's own lexicon with the above behavior.
 * Rare Words: all words with frequency less than a cut-off (of 100)
 * are allowed to take tags with which they were not seen during
 * training. In this case, they are eligible for (i) all tags that
 * either they were seen with, or (ii) any tag an unknown word can
 * receive (lexicon entry for UNK). The probability of a tag given a
 * rare word is an interpolation of the word's own tag distribution and
 * the unknown distribution for that word's signature. Because of the
 * tag-splitting used in our parser, this ability to take
 * out-of-lexicon tags is fairly important, and not represented in our
 * text lexicon.
 * 
 * For additional information
 * 
 * For more information, you should next look at the Javadocs for the
 * LexicalizedParser class.  In particular, the main method of
 * that class documents more precisely a number of the input preprocessing
 * options that were presented chattily above.
 * 
 * @author Dan Klein
 * @author Christopher Manning
 * @author Roger Levy
 * @author Teg Grenager
 * @author Galen Andrew
 */
package edu.stanford.nlp.parser.lexparser;