edu.stanford.nlp.pipeline.package-info Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of stanford-parser Show documentation
Stanford Parser processes raw text in English, Chinese, German, Arabic, and French, and extracts constituency parse trees.
There is a newer version: 3.9.2
Show newest version
/**
 * Linguistic Annotation Pipeline
 * The point of this package is to enable people to quickly and
 * painlessly get complete linguistic annotations of their text.  It
 * is designed to be highly flexible and extensible.  I will first discuss
 * the organization and functions of the classes, and then I will give some
 * sample code and a run-down of the implemented Annotators.
 * 
 * 
Annotation
 * An Annotation is the data structure which holds the results of annotators.
 * An Annotations is basically a map, from keys to bits of annotation, such
 * as the parse, the part-of-speech tags, or named entity tags.  Annotations
 * are designed to operate at the sentence-level, however depending on the
 * Annotators you use this may not be how you choose to use the package.
 * Annotators
 * The backbone of this package are the Annotators.  Annotators are a lot like
 * functions, except that they operate over Annotations instead of Objects.
 * They do things like tokenize, parse, or NER tag sentences.  In the
 * javadocs of your Annotator you should specify what the Annotator is
 * assuming already exists (for instance, the NERAnnotator assumes that the
 * sentence has been tokenized) and where to find these annotations (in
 * the example from the previous set of parentheses, it would be
 * TextAnnotation.class).  They should also specify what they add
 * to the annotation, and where.
 * AnnotationPipeline
 * An AnnotationPipeline is where many Annotators are strung together
 * to form a linguistic annotation pipeline.  It is, itself, an
 * Annotator.  AnnotationPipelines usually also keep track of how much time
 * they spend annotating and loading to assist users in finding where the
 * time sinks are.
 * However, the class AnnotationPipeline is not meant to be used as is.
 * It serves as an example on how to build your own pipeline.
 * If you just want to use a typical NLP pipeline take a look at StanfordCoreNLP
 * (described later in this document).
 * Sample Usage
 * Here is some sample code which illustrates the intended usage
 * of the package:
 *  * public void testPipeline(String text) throws Exception {
 * // create pipeline
 * AnnotationPipeline pipeline = new AnnotationPipeline();
 * pipeline.addAnnotator(new TokenizerAnnotator(false, "en"));
 * pipeline.addAnnotator(new WordsToSentencesAnnotator(false));
 * pipeline.addAnnotator(new POSTaggerAnnotator(false));
 * pipeline.addAnnotator(new MorphaAnnotator(false));
 * pipeline.addAnnotator(new NERCombinerAnnotator(false));
 * pipeline.addAnnotator(new ParserAnnotator(false, -1));
 * // create annotation with text
 * Annotation document = new Annotation(text);
 * // annotate text with pipeline
 * pipeline.annotate(document);
 * // demonstrate typical usage
 * for (CoreMap sentence: document.get(CoreAnnotations.SentencesAnnotation.class)) {
 * // get the tree for the sentence
 * Tree tree = sentence.get(TreeAnnotation.class);
 * // get the tokens for the sentence and iterate over them
 * for (CoreLabel token: sentence.get(CoreAnnotations.TokensAnnotation.class)) {
 * // get token attributes
 * String tokenText = token.get(TextAnnotation.class);
 * String tokenPOS = token.get(PartOfSpeechAnnotation.class);
 * String tokenLemma = token.get(LemmaAnnotation.class);
 * String tokenNE = token.get(NamedEntityTagAnnotation.class);
 * }
 * }
 * }
 * 
 * Existing Annotators
 * There already exist Annotators for many common tasks, all of which include
 * default model locations, so they can just be used off the shelf.  They are:
 * 
 * TokenizerAnnotator - tokenizes the text based on language or Tokenizer class specifications 
 * WordsToSentencesAnnotator - splits a sequence of words into a sequence of sentences
 * POSTaggerAnnotator - annotates the text with part-of-speech tags 
 * MorphaAnnotator - morphological normalizer (generates lemmas)
 * NERClassifierCombiner - combines several NER models 
 * TrueCaseAnnotator - detects the true case of words in free text (useful for all upper or lower case text)
 * ParserAnnotator - generates constituent and dependency trees
 * NumberAnnotator - recognizes numerical entities such as numbers, money, times, and dates
 * TimeWordAnnotator - recognizes common temporal expressions, such as "teatime"
 * QuantifiableEntityNormalizingAnnotator - normalizes the content of all numerical entities
 * DeterministicCorefAnnotator - implements anaphora resolution using a deterministic model 
 * NFLAnnotator - implements entity and relation mention extraction for the NFL domain
 * 
 * How Do I Use This?
 * You do not have to construct your pipeline from scratch! For the typical NL processors, use
 * StanfordCoreNLP. This pipeline implements the most common functionality needed: tokenization,
 * lemmatization, POS tagging, NER, parsing and coreference resolution. Read below for how to use
 * this pipeline from the command line, or directly in your Java code.
 * Using StanfordCoreNLP from the Command Line
 * The command line for StanfordCoreNLP is:
 *  * ./bin/stanfordcorenlp.sh
 * 
 * or
 *  * java -cp stanford-corenlp-YYYY-MM-DD.jar:stanford-corenlp-YYYY-MM-DD-models.jar:xom.jar:joda-time.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP [ -props YOUR_CONFIGURATION_FILE ] -file YOUR_INPUT_FILE
 * 
 * where the following properties are defined:
 * (if -props or annotators is not defined, default properties will be loaded via the classpath)
 *  * 	"annotators" - comma separated list of annotators
 * 		The following annotators are supported: tokenize, ssplit, pos, lemma, ner, truecase, parse, dcoref, nfl
 * 
 * More information is available here: Stanford CoreNLP
 * 
 * The StanfordCoreNLP API
 * More information is available here: Stanford CoreNLP
 * 
 * @author Jenny Finkel
 * @author Mihai Surdeanu
 * @author Steven Bethard
 * @author David McClosky
 *  Last modified: May 7, 2012 
 */
package edu.stanford.nlp.pipeline;