edu.stanford.nlp.pipeline.package-info Maven / Gradle / Ivy
Go to download
Show more of this group Show more artifacts with this name
Show all versions of stanford-parser Show documentation
Show all versions of stanford-parser Show documentation
Stanford Parser processes raw text in English, Chinese, German, Arabic, and French, and extracts constituency parse trees.
/**
* Linguistic Annotation Pipeline
* The point of this package is to enable people to quickly and
* painlessly get complete linguistic annotations of their text. It
* is designed to be highly flexible and extensible. I will first discuss
* the organization and functions of the classes, and then I will give some
* sample code and a run-down of the implemented Annotators.
*
*
Annotation
* An Annotation is the data structure which holds the results of annotators.
* An Annotations is basically a map, from keys to bits of annotation, such
* as the parse, the part-of-speech tags, or named entity tags. Annotations
* are designed to operate at the sentence-level, however depending on the
* Annotators you use this may not be how you choose to use the package.
* Annotators
* The backbone of this package are the Annotators. Annotators are a lot like
* functions, except that they operate over Annotations instead of Objects.
* They do things like tokenize, parse, or NER tag sentences. In the
* javadocs of your Annotator you should specify what the Annotator is
* assuming already exists (for instance, the NERAnnotator assumes that the
* sentence has been tokenized) and where to find these annotations (in
* the example from the previous set of parentheses, it would be
* TextAnnotation.class
). They should also specify what they add
* to the annotation, and where.
* AnnotationPipeline
* An AnnotationPipeline is where many Annotators are strung together
* to form a linguistic annotation pipeline. It is, itself, an
* Annotator. AnnotationPipelines usually also keep track of how much time
* they spend annotating and loading to assist users in finding where the
* time sinks are.
* However, the class AnnotationPipeline is not meant to be used as is.
* It serves as an example on how to build your own pipeline.
* If you just want to use a typical NLP pipeline take a look at StanfordCoreNLP
* (described later in this document).
* Sample Usage
* Here is some sample code which illustrates the intended usage
* of the package:
*
* public void testPipeline(String text) throws Exception {
* // create pipeline
* AnnotationPipeline pipeline = new AnnotationPipeline();
* pipeline.addAnnotator(new TokenizerAnnotator(false, "en"));
* pipeline.addAnnotator(new WordsToSentencesAnnotator(false));
* pipeline.addAnnotator(new POSTaggerAnnotator(false));
* pipeline.addAnnotator(new MorphaAnnotator(false));
* pipeline.addAnnotator(new NERCombinerAnnotator(false));
* pipeline.addAnnotator(new ParserAnnotator(false, -1));
* // create annotation with text
* Annotation document = new Annotation(text);
* // annotate text with pipeline
* pipeline.annotate(document);
* // demonstrate typical usage
* for (CoreMap sentence: document.get(CoreAnnotations.SentencesAnnotation.class)) {
* // get the tree for the sentence
* Tree tree = sentence.get(TreeAnnotation.class);
* // get the tokens for the sentence and iterate over them
* for (CoreLabel token: sentence.get(CoreAnnotations.TokensAnnotation.class)) {
* // get token attributes
* String tokenText = token.get(TextAnnotation.class);
* String tokenPOS = token.get(PartOfSpeechAnnotation.class);
* String tokenLemma = token.get(LemmaAnnotation.class);
* String tokenNE = token.get(NamedEntityTagAnnotation.class);
* }
* }
* }
*
* Existing Annotators
* There already exist Annotators for many common tasks, all of which include
* default model locations, so they can just be used off the shelf. They are:
*
* - TokenizerAnnotator - tokenizes the text based on language or Tokenizer class specifications
* - WordsToSentencesAnnotator - splits a sequence of words into a sequence of sentences
* - POSTaggerAnnotator - annotates the text with part-of-speech tags
* - MorphaAnnotator - morphological normalizer (generates lemmas)
* - NERClassifierCombiner - combines several NER models
* - TrueCaseAnnotator - detects the true case of words in free text (useful for all upper or lower case text)
* - ParserAnnotator - generates constituent and dependency trees
* - NumberAnnotator - recognizes numerical entities such as numbers, money, times, and dates
* - TimeWordAnnotator - recognizes common temporal expressions, such as "teatime"
* - QuantifiableEntityNormalizingAnnotator - normalizes the content of all numerical entities
* - DeterministicCorefAnnotator - implements anaphora resolution using a deterministic model
* - NFLAnnotator - implements entity and relation mention extraction for the NFL domain
*
* How Do I Use This?
* You do not have to construct your pipeline from scratch! For the typical NL processors, use
* StanfordCoreNLP. This pipeline implements the most common functionality needed: tokenization,
* lemmatization, POS tagging, NER, parsing and coreference resolution. Read below for how to use
* this pipeline from the command line, or directly in your Java code.
* Using StanfordCoreNLP from the Command Line
* The command line for StanfordCoreNLP is:
*
* ./bin/stanfordcorenlp.sh
*
* or
*
* java -cp stanford-corenlp-YYYY-MM-DD.jar:stanford-corenlp-YYYY-MM-DD-models.jar:xom.jar:joda-time.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP [ -props YOUR_CONFIGURATION_FILE ] -file YOUR_INPUT_FILE
*
* where the following properties are defined:
* (if -props
or annotators
is not defined, default properties will be loaded via the classpath)
*
* "annotators" - comma separated list of annotators
* The following annotators are supported: tokenize, ssplit, pos, lemma, ner, truecase, parse, dcoref, nfl
*
* More information is available here: Stanford CoreNLP
*
* The StanfordCoreNLP API
* More information is available here: Stanford CoreNLP
*
* @author Jenny Finkel
* @author Mihai Surdeanu
* @author Steven Bethard
* @author David McClosky
* Last modified: May 7, 2012
*/
package edu.stanford.nlp.pipeline;