edu.stanford.nlp.pipeline.package-info Maven / Gradle / Ivy

Show more of this group Show more artifacts with this name
Show all versions of stanford-corenlp Show documentation

Stanford CoreNLP provides a set of natural language analysis tools which can take raw English language text input and give the base forms of words, their parts of speech, whether they are names of companies, people, etc., normalize dates, times, and numeric quantities, mark up the structure of sentences in terms of phrases and word dependencies, and indicate which noun phrases refer to the same entities. It provides the foundational building blocks for higher level text understanding applications.

There is a newer version: 4.5.7

Show newest version

/**
 * Linguistic Annotation Pipeline
 * The point of this package is to enable people to quickly and
 * painlessly get complete linguistic annotations of their text.  It
 * is designed to be highly flexible and extensible.  I will first discuss
 * the organization and functions of the classes, and then I will give some
 * sample code and a run-down of the implemented Annotators.
 * 
 * 
Annotation
 * An Annotation is the data structure which holds the results of annotators.
 * An Annotations is basically a map, from keys to bits of annotation, such
 * as the parse, the part-of-speech tags, or named entity tags.  Annotations
 * are designed to operate at the sentence-level, however depending on the
 * Annotators you use this may not be how you choose to use the package.
 * Annotators
 * The backbone of this package are the Annotators.  Annotators are a lot like
 * functions, except that they operate over Annotations instead of Objects.
 * They do things like tokenize, parse, or NER tag sentences.  In the
 * javadocs of your Annotator you should specify what the Annotator is
 * assuming already exists (for instance, the NERAnnotator assumes that the
 * sentence has been tokenized) and where to find these annotations (in
 * the example from the previous set of parentheses, it would be
 * TextAnnotation.class).  They should also specify what they add
 * to the annotation, and where.
 * AnnotationPipeline
 * An AnnotationPipeline is where many Annotators are strung together
 * to form a linguistic annotation pipeline.  It is, itself, an
 * Annotator.  AnnotationPipelines usually also keep track of how much time
 * they spend annotating and loading to assist users in finding where the
 * time sinks are.
 * However, the class AnnotationPipeline is not meant to be used as is.
 * It serves as an example on how to build your own pipeline.
 * If you just want to use a typical NLP pipeline take a look at StanfordCoreNLP
 * (described later in this document).
 * Sample Usage
 * Here is some sample code which illustrates the intended usage
 * of the package:
 *  * public void testPipeline(String text) throws Exception {
 * // create pipeline
 * AnnotationPipeline pipeline = new AnnotationPipeline();
 * pipeline.addAnnotator(new TokenizerAnnotator(false, "en"));
 * pipeline.addAnnotator(new WordsToSentencesAnnotator(false));
 * pipeline.addAnnotator(new POSTaggerAnnotator(false));
 * pipeline.addAnnotator(new MorphaAnnotator(false));
 * pipeline.addAnnotator(new NERCombinerAnnotator(false));
 * pipeline.addAnnotator(new ParserAnnotator(false, -1));
 * // create annotation with text
 * Annotation document = new Annotation(text);
 * // annotate text with pipeline
 * pipeline.annotate(document);
 * // demonstrate typical usage
 * for (CoreMap sentence: document.get(CoreAnnotations.SentencesAnnotation.class)) {
 * // get the tree for the sentence
 * Tree tree = sentence.get(TreeAnnotation.class);
 * // get the tokens for the sentence and iterate over them
 * for (CoreLabel token: sentence.get(CoreAnnotations.TokensAnnotation.class)) {
 * // get token attributes
 * String tokenText = token.get(TextAnnotation.class);
 * String tokenPOS = token.get(PartOfSpeechAnnotation.class);
 * String tokenLemma = token.get(LemmaAnnotation.class);
 * String tokenNE = token.get(NamedEntityTagAnnotation.class);
 * }
 * }
 * }
 * 
 * Existing Annotators
 * There already exist Annotators for many common tasks, all of which include
 * default model locations, so they can just be used off the shelf.  They are:
 * 
 * TokenizerAnnotator - tokenizes the text based on language or Tokenizer class specifications 
 * WordsToSentencesAnnotator - splits a sequence of words into a sequence of sentences
 * POSTaggerAnnotator - annotates the text with part-of-speech tags 
 * MorphaAnnotator - morphological normalizer (generates lemmas)
 * NERClassifierCombiner - combines several NER models 
 * TrueCaseAnnotator - detects the true case of words in free text (useful for all upper or lower case text)
 * ParserAnnotator - generates constituent and dependency trees
 * NumberAnnotator - recognizes numerical entities such as numbers, money, times, and dates
 * TimeWordAnnotator - recognizes common temporal expressions, such as "teatime"
 * QuantifiableEntityNormalizingAnnotator - normalizes the content of all numerical entities
 * DeterministicCorefAnnotator - implements anaphora resolution using a deterministic model 
 * NFLAnnotator - implements entity and relation mention extraction for the NFL domain
 * 
 * How Do I Use This?
 * You do not have to construct your pipeline from scratch! For the typical NL processors, use
 * StanfordCoreNLP. This pipeline implements the most common functionality needed: tokenization,
 * lemmatization, POS tagging, NER, parsing and coreference resolution. Read below for how to use
 * this pipeline from the command line, or directly in your Java code.
 * Using StanfordCoreNLP from the Command Line
 * The command line for StanfordCoreNLP is:
 *  * ./bin/stanfordcorenlp.sh
 * 
 * or
 *  * java -cp stanford-corenlp-YYYY-MM-DD.jar:stanford-corenlp-YYYY-MM-DD-models.jar:xom.jar:joda-time.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP [ -props YOUR_CONFIGURATION_FILE ] -file YOUR_INPUT_FILE
 * 
 * where the following properties are defined:
 * (if -props or annotators is not defined, default properties will be loaded via the classpath)
 *  * 	"annotators" - comma separated list of annotators
 * 		The following annotators are supported: tokenize, ssplit, pos, lemma, ner, truecase, parse, dcoref, nfl
 * 
 * More information is available here: Stanford CoreNLP
 * 
 * The StanfordCoreNLP API
 * More information is available here: Stanford CoreNLP
 * 
 * @author Jenny Finkel
 * @author Mihai Surdeanu
 * @author Steven Bethard
 * @author David McClosky
 *  Last modified: May 7, 2012 
 */
package edu.stanford.nlp.pipeline;