All Downloads are FREE. Search and download functionalities are using the official Maven repository.

edu.stanford.nlp.pipeline.package-info Maven / Gradle / Ivy

Go to download

Stanford Parser processes raw text in English, Chinese, German, Arabic, and French, and extracts constituency parse trees.

There is a newer version: 3.9.2
Show newest version
/**
 * 

Linguistic Annotation Pipeline

* The point of this package is to enable people to quickly and * painlessly get complete linguistic annotations of their text. It * is designed to be highly flexible and extensible. I will first discuss * the organization and functions of the classes, and then I will give some * sample code and a run-down of the implemented Annotators. *

*

Annotation

* An Annotation is the data structure which holds the results of annotators. * An Annotations is basically a map, from keys to bits of annotation, such * as the parse, the part-of-speech tags, or named entity tags. Annotations * are designed to operate at the sentence-level, however depending on the * Annotators you use this may not be how you choose to use the package. *

Annotators

* The backbone of this package are the Annotators. Annotators are a lot like * functions, except that they operate over Annotations instead of Objects. * They do things like tokenize, parse, or NER tag sentences. In the * javadocs of your Annotator you should specify what the Annotator is * assuming already exists (for instance, the NERAnnotator assumes that the * sentence has been tokenized) and where to find these annotations (in * the example from the previous set of parentheses, it would be * TextAnnotation.class). They should also specify what they add * to the annotation, and where. *

AnnotationPipeline

* An AnnotationPipeline is where many Annotators are strung together * to form a linguistic annotation pipeline. It is, itself, an * Annotator. AnnotationPipelines usually also keep track of how much time * they spend annotating and loading to assist users in finding where the * time sinks are. * However, the class AnnotationPipeline is not meant to be used as is. * It serves as an example on how to build your own pipeline. * If you just want to use a typical NLP pipeline take a look at StanfordCoreNLP * (described later in this document). *

Sample Usage

* Here is some sample code which illustrates the intended usage * of the package: *
 * public void testPipeline(String text) throws Exception {
 * // create pipeline
 * AnnotationPipeline pipeline = new AnnotationPipeline();
 * pipeline.addAnnotator(new TokenizerAnnotator(false, "en"));
 * pipeline.addAnnotator(new WordsToSentencesAnnotator(false));
 * pipeline.addAnnotator(new POSTaggerAnnotator(false));
 * pipeline.addAnnotator(new MorphaAnnotator(false));
 * pipeline.addAnnotator(new NERCombinerAnnotator(false));
 * pipeline.addAnnotator(new ParserAnnotator(false, -1));
 * // create annotation with text
 * Annotation document = new Annotation(text);
 * // annotate text with pipeline
 * pipeline.annotate(document);
 * // demonstrate typical usage
 * for (CoreMap sentence: document.get(CoreAnnotations.SentencesAnnotation.class)) {
 * // get the tree for the sentence
 * Tree tree = sentence.get(TreeAnnotation.class);
 * // get the tokens for the sentence and iterate over them
 * for (CoreLabel token: sentence.get(CoreAnnotations.TokensAnnotation.class)) {
 * // get token attributes
 * String tokenText = token.get(TextAnnotation.class);
 * String tokenPOS = token.get(PartOfSpeechAnnotation.class);
 * String tokenLemma = token.get(LemmaAnnotation.class);
 * String tokenNE = token.get(NamedEntityTagAnnotation.class);
 * }
 * }
 * }
 * 
*

Existing Annotators

* There already exist Annotators for many common tasks, all of which include * default model locations, so they can just be used off the shelf. They are: *
    *
  • TokenizerAnnotator - tokenizes the text based on language or Tokenizer class specifications
  • *
  • WordsToSentencesAnnotator - splits a sequence of words into a sequence of sentences
  • *
  • POSTaggerAnnotator - annotates the text with part-of-speech tags
  • *
  • MorphaAnnotator - morphological normalizer (generates lemmas)
  • *
  • NERClassifierCombiner - combines several NER models
  • *
  • TrueCaseAnnotator - detects the true case of words in free text (useful for all upper or lower case text)
  • *
  • ParserAnnotator - generates constituent and dependency trees
  • *
  • NumberAnnotator - recognizes numerical entities such as numbers, money, times, and dates
  • *
  • TimeWordAnnotator - recognizes common temporal expressions, such as "teatime"
  • *
  • QuantifiableEntityNormalizingAnnotator - normalizes the content of all numerical entities
  • *
  • DeterministicCorefAnnotator - implements anaphora resolution using a deterministic model
  • *
  • NFLAnnotator - implements entity and relation mention extraction for the NFL domain
  • *
*

How Do I Use This?

* You do not have to construct your pipeline from scratch! For the typical NL processors, use * StanfordCoreNLP. This pipeline implements the most common functionality needed: tokenization, * lemmatization, POS tagging, NER, parsing and coreference resolution. Read below for how to use * this pipeline from the command line, or directly in your Java code. *

Using StanfordCoreNLP from the Command Line

* The command line for StanfordCoreNLP is: *
 * ./bin/stanfordcorenlp.sh
 * 
* or *
 * java -cp stanford-corenlp-YYYY-MM-DD.jar:stanford-corenlp-YYYY-MM-DD-models.jar:xom.jar:joda-time.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP [ -props YOUR_CONFIGURATION_FILE ] -file YOUR_INPUT_FILE
 * 
* where the following properties are defined: * (if -props or annotators is not defined, default properties will be loaded via the classpath) *
 * 	"annotators" - comma separated list of annotators
 * 		The following annotators are supported: tokenize, ssplit, pos, lemma, ner, truecase, parse, dcoref, nfl
 * 
* More information is available here: Stanford CoreNLP * *

The StanfordCoreNLP API

* More information is available here: Stanford CoreNLP * * @author Jenny Finkel * @author Mihai Surdeanu * @author Steven Bethard * @author David McClosky * Last modified: May 7, 2012 */ package edu.stanford.nlp.pipeline;




© 2015 - 2024 Weber Informatics LLC | Privacy Policy