edu.stanford.nlp.parser.lexparser.package.html Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of stanford-parser Show documentation
Stanford Parser processes raw text in English, Chinese, German, Arabic, and French, and extracts constituency parse trees.
There is a newer version: 3.9.2
Show newest version


  

This package contains implementations of three probabilistic parsers for
natural language text.  There is an accurate unlexicalized probabilistic
      context-free grammar (PCFG) parser, a probabilistic lexical dependency parser,
      and a factored, lexicalized
      probabilistic context free grammar parser, which does joint inference
      over the product of the first two parsers.  The parser supports various
languages and input formats.
For English, for most purposes, we now recommend just using the unlexicalized PCFG.
  With a well-engineered grammar (as supplied for English), it is
      fast, accurate, requires much less memory, and in many real-world uses,
      lexical preferences are
      unavailable or inaccurate across domains or genres and the
      unlexicalized parser will
      perform just as well as a lexicalized parser.  However, the
      factored parser will sometimes provide greater accuracy on English
      through
      knowledge of lexical dependencies.  Moreover, it is considerably better than the
  PCFG parser alone for most other languages (with less rigid word
  order), including German, Chinese, and Arabic.  The dependency parser
  can be run alone, but this is
     usually not useful (its accuracy is much lower). The output
  of the parser can be presented in various forms, such as just part-of-speech
  tags, phrase structure trees, or dependencies, and is controlled by options
  passed to the TreePrint class.


References

The factored parser and the unlexicalized PCFG parser are described in:

Dan Klein and Christopher D. Manning. 2002. Fast Exact Inference with a
      Factored Model for Natural Language Parsing. Advances
      in Neural Information Processing Systems 15 (NIPS 2002).
[pdf]

Dan Klein and Christopher D. Manning. 2003. Accurate
Unlexicalized Parsing. Proceedings of the Association for
	  Computational Linguistics, 2003.
[pdf]


The factored parser uses a product model, where the preferences of an
    unlexicalized PCFG parser and a lexicalized dependency parser are
    combined by a third parser, which does exact search using
    A* outside estimates (which are Viterbi outside scores,
    precalculated during PCFG and dependency parsing of the sentence).



We have been splitting up the parser into public classes, but some of
the internals are still contained in the file
FactoredParser.java.


The class LexicalizedParser provides an interface for
either
training a parser from a treebank, or parsing text using a saved
parser.  It can be called programmatically, or the commandline main()
method supports many options.


The parser has been ported to multiple languages.  German, Chinese, and Arabic
grammars are included.  The first publication below documents the
Chinese parser.  The German parser was developed for and used in the
second paper (but the paper contains very little detail on it).

Roger Levy and Christopher D. Manning. 2003. Is it harder to
parse Chinese, or the Chinese Treebank?  ACL 2003, pp. 439-446.
Roger Levy and Christopher D. Manning. 2004. Deep dependencies from
context-free statistical parsers: correcting the surface dependency
approximation. ACL 2004, pp. 328-335.


The grammatical relations output of the parser is presented in:


Marie-Catherine de Marneffe, Bill MacCartney and Christopher
D. Manning. 2006. Generating Typed Dependency Parses from Phrase Structure
Parses.  LREC 2006.


End user usage

Requirements

You need Java 1.6+ installed on your system, and
      java in your PATH where commands are looked for.


You need a machine with a fair amount of memory.  Required memory
      depends on the choice of parser, the size of the grammar, and
      other factors like the presence of numerous unknown words
      To run the PCFG parser
      on sentences of up to 40 words you need 100 MB of memory.  To be
      able to handle longer sentences, you need more (to parse sentences
      up to 100 words, you need 400 MB).  For running the
      Factored Parser, 600 MB is needed for dealing with sentences
      up to 40 words.  Factored parsing of sentences up to 200 words
      requires around 3GB of memory.
      Training a new lexicalized parser requires about 1500m of memory;
      much less is needed for training a PCFG.



For just parsing text, you need a saved parser model (grammars, lexicon,
      etc.), which can be
      represented either as a text file or as a binary (Java serialized
      object) representation, and which can be gzip compressed.
    A number are provided contained in the supplied
    stanford-parser-$VERSION-models.jar file in the distributed version,
    and can be accessed from there by having this jar file on your
    CLASSPATH and specifying them via a classpath entry such as:
    edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz.
    (Stanford NLP people can also find the grammars in the directory
    /u/nlp/data/lexparser.)  Other available grammars include
    englishFactored.ser.gz for English, and
    chineseFactored.ser.gz for Chinese.


    You need the parser code and grammars
          accessible.  This can be done by having the supplied jar files on
         your CLASSPATH.  The examples below assume you are in the parser
        distribution home directory. From there you can set up the classpath with the
        command-line argument  -cp "*" (or perhaps  -cp "*;"
        on certain versions of Windows).
    Then if you have some sentences in testsent.txt (as plain
      text), the following commands should work.



Command-line parsing usage

Parsing a local text file:

java -mx100m -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser
    edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz testsent.txt



Parsing a document over the web:

java -mx100m -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser
-maxLength 40 edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz http://nlp.stanford.edu/software/lex-parser.shtml
Note the -maxLength flag: this will set a maximum length
sentence to parse.  If you do not set one, the parser will try to parse
sentences up to any length, but will usually run out of memory when
trying to do this.  This is important with web pages with text that may
not be real sentences (or just with technical documents that turn out to
have 300 word sentences).
The parser just does very rudimentary stripping of HTML tags, and
     so it'll work okay on plain text web pages, but it won't work
      adequately on most complex commercial script-driven pages.  If you
  want to handle these, you'll need to provide your own preprocessor,
    and then to call the parser on its output.

The parser will send parse trees to stdout and other
information on what it is doing to stderr, so one commonly
wants to direct just stdout to an output file, in the
standard way.

Other languages: Chinese

Parsing a Chinese sentence (in the default input encoding for
Chinese of GB18030
  - note you'll need the right fonts to see the output correctly):

java -mx100m -cp "*" edu.stanford.nlp.parser.lexparser.LexicalizedParser -tLPP
edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams
    edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz chinese-onesent

or for Unicode (UTF-8) format files:

java -mx100m -cp "*"edu.stanford.nlp.parser.lexparser.LexicalizedParser -tLPP
edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams
-encoding UTF-8 edu/stanford/nlp/models/lexparser/chinesePCFG.ser.gz chinese-onesent-utf



For Chinese, the package includes two simple word segmenters.  One is a
lexicon-based maximum match segmenter, and the other uses the parser to
do Hidden Markov Model-based word segmentation.  These segmentation
methods are okay, but if you would like a high quality segmentation of
Chinese text, you will have to segment the Chinese by yourself as a
preprocessing step.  The supplied grammars assume that
Chinese input has already been word-segmented according to Penn
Chinese Treebank conventions.  Choosing
      Chinese with -tLPP
edu.stanford.nlp.parser.lexparser.ChineseTreebankParserParams
      makes space-separated words the default tokenization.
To do word segmentation within the parser, give one of the options
-segmentMarkov or -segmentMaxMatch.


Other languages


The parser also supports other languages including German and French.


Command-line options

The program has many options.  The most useful end-user option is
      -maxLength n which determines the maximum
length sentence that the parser will parser.  Longer sentences are
      skipped, with a message printed to stderr.

Input formatting and tokenization options

The parser supports many different input formats: tokenized/not,
sentences/not, and tagged/not.

The input may be
tokenized or not, and users may supply their own tokenizers. The input
is by default assumed to not be tokenized; if the
input is tokenized, supply the option -tokenized. If the
input is not tokenized, you may supply the name of a tokenizer class
with -tokenizer tokenizerClassName; otherwise the default
tokenizer (edu.stanford.nlp.processor.PTBTokenizer) is
used.  This tokenizer should perform well over typical plain
      newswire-style text.

The
input may have already been split into sentences or not. The input is by
      default assumed
to be not split; if sentences are split, supply the option
-sentences delimitingToken, where the delimiting token
may be any string.  As a special case, if the delimiting token
is "newline" the parser will assume that each line of the
file is a sentence.

Simple XML can also be parsed.  The main method does not incorporate an XML
parser, but one can fake certain simple cases with the
-parseInside regex which will only parse the tokens inside
elements matched by the regular expression regex.  These
elements are assumed to be pure CDATA.
If you use -parseInside s, then the parser will accept
input in which sentences are marked XML-style with
<s> ... </s> (the same format as the input to
Eugene Charniak's parser).



Finally, the input may be tagged or not. If it is tagged, the program
assumes that words and tags are separated by a non-whitespace
separating character such as '/' or '_'. You give the option
-tagSeparator tagSeparator to specify tagged text with a
tag separator. You also need to tell the parser to use a different
tokenizer, using the flags
-tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer
      -tokenizerMethod newCoreLabelTokenizerFactory


You can see examples of many of these options in the
test directory. As an example, you can parse the example file with partial POS-tagging
with this command:

java edu.stanford.nlp.parser.lexparser.LexicalizedParser -maxLength 20 -sentences newline -tokenized -tagSeparator / -tokenizerFactory edu.stanford.nlp.process.WhitespaceTokenizer -tokenizerMethod newCoreLabelTokenizerFactory englishPCFG.ser.gz pos-sentences.txt


There are some restrictions on the interpretation of POS-tagged input:

The tagset must match the parser POS set.  If you are using our
supplied parser data files, that means you must be using Penn Treebank
POS tags.
An indicated tagging will determine which of the taggings allowed by
the lexicon
will be used, but the parser will not accept tags not allowed by its
lexicon.  This is usually not problematic, since rare or unknown words
are allowed to have many POS tags, but would be if you were trying to
persuade it that are should be tagged as a noun in the sentence "100
are make up one hectare." since it will only allow are to
	have a verbal tagging.



For the examples in pos-sentences.txt:

 This sentence is parsed correctly with no tags given.
 So it is also parsed correctly telling the parser butter is a verb.
 You get a different worse parse telling it butter is a noun.
 You get the same parse as 1. with all tags correctly supplied.
 It won't accept can as a VB, but does accept butter
	as a noun, so you get the same parse as 3.
 People can butter can be an NP.
 Most words can be NN, but not common function words like their,
	  with, a.


Note that if the program is reading tags correctly, they aren't
    printed in the
sentence it says it is parsing.  Only the words are printed there.



Output formatting options

You can set how sentences are printed out by using the
-outputFormat format option.  The native and default format is as
trees are formatted in the Penn Treebank, but there are a number of
other useful options:


penn The default.
oneline Printed out on one line.
wordsAndTags Use the parser as a POS tagger.
latexTree Help write your LaTeX papers (for use with
Avery Andrews' trees.sty package.
typedDependenciesCollapsed Write sentences in a typed
dependency format that represents sentences via grammatical relations
between words.  Suitable for representing text as a semantic network.


You can get each sentence printed in multiple formats by giving a
comma-separated list of formats.  See the TreePrint class for more
information on available output formats and options.


Programmatic usage

LexicalizedParser can be easily called
      within a larger
application.  It implements a couple of useful interfaces that
provide for simple use:
edu.stanford.nlp.parser.ViterbiParser
and edu.stanford.nlp.process.Function.
The following simple class shows typical usage:

import java.util.*;
import edu.stanford.nlp.ling.*;
import edu.stanford.nlp.trees.*;
import edu.stanford.nlp.parser.lexparser.LexicalizedParser;

class ParserDemo {
  public static void main(String[] args) {
    LexicalizedParser lp = LexicalizedParser.loadModel("edu/stanford/nlp/models/lexparser/englishPCFG.ser.gz");
    lp.setOptionFlags(new String[]{"-maxLength", "80", "-retainTmpSubcategories"});

    String[] sent = { "This", "is", "an", "easy", "sentence", "." };
    List<CoreLabel> rawWords = Sentence.toCoreLabelList(sent);
    Tree parse = lp.apply(rawWords);
    parse.pennPrint();
    System.out.println();

    TreebankLanguagePack tlp = new PennTreebankLanguagePack();
    GrammaticalStructureFactory gsf = tlp.grammaticalStructureFactory();
    GrammaticalStructure gs = gsf.newGrammaticalStructure(parse);
    List<TypedDependency> tdl = gs.typedDependenciesCCprocessed();
    System.out.println(tdl);
    System.out.println();

    TreePrint tp = new TreePrint("penn,typedDependenciesCollapsed");
    tp.printTree(parse);
  }
}


In a usage such as this, the parser expects sentences already
      tokenized according to Penn Treebank conventions.  For arbitrary text,
      prior processing must be done to achieve such tokenization (the
      main method of LexicalizedParser provides an
      example of doing this).  The example shows how most command-line
      arguments can also be passed to the parser when called
      programmatically. Note that using the
     -retainTmpSubcategories option is necessary to get the best
    results in the typed dependencies output recognizing temporal noun phrases
      ("last week", "next February").


Some code fragments which include tokenization using Penn Treebank conventions follows:

import java.io.StringReader;
import edu.stanford.nlp.trees.Tree;
import edu.stanford.nlp.objectbank.TokenizerFactory;
import edu.stanford.nlp.process.CoreLabelTokenFactory;
import edu.stanford.nlp.ling.CoreLabel;
import edu.stanford.nlp.process.PTBTokenizer;
import edu.stanford.nlp.parser.lexparser.LexicalizedParser;

LexicalizedParser lp = LexicalizedParser.loadModel("englishPCFG.ser.gz");
lp.setOptionFlags(new String[]{"-outputFormat", "penn,typedDependenciesCollapsed", "-retainTmpSubcategories"});
TokenizerFactory<CoreLabel> tokenizerFactory = PTBTokenizer.factory(new CoreLabelTokenFactory(), "");

public Tree processSentence(String sentence) {
    List<CoreLabel> rawWords = tokenizerFactory.getTokenizer(new StringReader(sentence)).tokenize();
    Tree bestParse = lp.parseTree(rawWords);
    return bestParse;
}



Writing and reading trained parsers to and from files

A trained parser consists of grammars, a lexicon, and option values. Once
a parser has been trained, it may be written to file in one of two
formats: binary serialized Java objects or human readable text data. A parser
can also be quickly reconstructed (either programmatically or at the command line)
from files containing a parser in either of these formats.
The binary serialized Java
objects format is created using standard tools provided by the java.io
package, and is not text, and not human-readable. To train and then save a parser
as a binary serialized objects file, use a command line invocation of the form:

java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser
     -train trainFilePath [fileRange] -saveToSerializedFile outputFilePath


The text data format is human readable and modifiable, and consists of
four sections, appearing in the following order:

Options - consists of variable-value pairs, one per line, which must remain constant across training and parsing.
Lexicon - consists of lexical entries, one per line, each of which is preceded by the keyword SEEN or UNSEEN, and followed by a raw count.
Unary Grammar - consists of unary rewrite rules, one per line, each of which is of the form A -> B, followed by the normalized log probability.
Binary Grammar - consists of binary rewrite rules, one per line, each of which is of the form A -> B C, followed by the normalized log probability.
Dependency Grammar

Each section is headed by a line consisting of multiple asterisks (*) and the name
of the section. Note that the file format does not support rules of arbitrary arity,
only binary and unary rules. To train and then save a parser
as a text data file, use a command line invocation of the form:

java -mx1500m edu.stanford.nlp.parser.lexparser.LexicalizedParser
     -train trainFilePath start stop -saveToTextFile outputFilePath

To parse a file with a saved parser, either in text data or serialized data format, use a command line invocation of the following form:

java -mx500m edu.stanford.nlp.parser.lexparser.LexicalizedParser
     parserFilePath test.txt


A Note on Text Grammars

If you want to use the text grammars in another parser and duplicate our
performance, you will need to know how we handle the POS tagging of rare
and unknown words:

Unknown Words: rather than scoring all words unseen during
    training with a single distribution over tags, we score unknown
    words based on their word shape signatures, defined as
    follows. Beginning with the original string, all lowercase
    alphabetic characters are replaced with x, uppercase with X, digits
    with d, and other characters are unchanged. Then, consecutive
    duplicates are eliminated. For example, Formula-1 would become
    Xx-1. The probability of tags given signatures is estimated on words
    occurring in only the second half of the training data, then
    inverted. However, in the current release of the parser, this is all
    done programmatically, and so the text lexicon contains only a single
    UNK token. To duplicate our behavior, one would be best off building
    one's own lexicon with the above behavior.
Rare Words: all words with frequency less than a cut-off (of 100)
    are allowed to take tags with which they were not seen during
    training. In this case, they are eligible for (i) all tags that
    either they were seen with, or (ii) any tag an unknown word can
    receive (lexicon entry for UNK). The probability of a tag given a
    rare word is an interpolation of the word's own tag distribution and
    the unknown distribution for that word's signature. Because of the
    tag-splitting used in our parser, this ability to take
    out-of-lexicon tags is fairly important, and not represented in our
    text lexicon.


For additional information


For more information, you should next look at the Javadocs for the
LexicalizedParser class.  In particular, the main method of
that class documents more precisely a number of the input preprocessing
options that were presented chattily above.


@author Dan Klein
@author Christopher Manning
@author Roger Levy
@author Teg Grenager
@author Galen Andrew