edu.stanford.nlp.tagger.maxent.package-info Maven / Gradle / Ivy
Show all versions of stanford-corenlp Show documentation
/**
*
* A Maximum Entropy Part-of-Speech Tagger. It can run either a Conditional
* Markov Model (CMM) aka Maximum Entropy Markov Model (MEMM) tagger or a
* cyclic dependency network tagger.
*
*
* If you are only interested in using one of the trained taggers
* included in the distribution, either from the commandline or via the
* Java API, look at the documentation of the class
* {@link edu.stanford.nlp.tagger.maxent.MaxentTagger}.
*
* Or, if you are interested in training a tagger from data using
* some of the built-in architecture options (CMM or bi-directional
* dependency network), also look at the documentation for
* {@link edu.stanford.nlp.tagger.maxent.MaxentTagger}.
*
* The rest of this document is for more complex situations where you
* want to define the features/architecture of your own tagger, which
* requires delving into the code.
*
* The pre-defined features are for CMMs and bi-directional dependency
* networks. The local models are log-linear models using features
* specified via feature templates.
*
*
* Kinds of Templates
*
* There are two kinds of templates: ones for rare words,
* and ones for common words. For a context centered at a common word, only common
* word features are active. For a context centered at a rare word, both common
* and rare word features are active. Which words are considered common and which
* rare is determined by the parameter GlobalHolder.rareThreshold. For example,
* a threshold of five means that words occurring five times or less are considered
* rare.
*
* The feature templates represent conditions on the words and tags
* surrounding the current position and also the target tag. For example
* <t_-1,t_0>
is a common word
* feature template. It is
* instantiated for various values of the previous and current tag. A feature
* formed by instantiating this template will look for example like this:
* <t_-1=DT,t_0=NN>
* and will have value 1 for a history h and tag t
* iff the condition is satisfied. Every feature
* template includes a specification of the current tag. It is not possible at the
* moment to include features that are true if the current tag is one of several
* possible, e.g. NN or NNS or NNP.
*
* To reduce the number
* of features, cutoffs on the number of times a feature is active are introduced.
* The cutoff for common word features is GlobalHolder.threshold, and
* the cutoff for rare word features is GlobalHolder.thresholdRare.
* The thresholds work like this: the part of the feature that does not include
* the tag has to be active in the training set at least cutoff+1 times, and the
* complete feature has to be active at least once in order for the feature to be
* included in the model. (Note, the cutoff for the current word feature is set to
* 2 independent of threshold settings; cf. method
* {@link edu.stanford.nlp.tagger.maxent.TaggerExperiments#populated(int, int)}.
*
*
* Training a Tagger
*
* In order to train a tagger, we need to specify the feature
* templates to be used, change the count cutoffs if we want, change the default
* parameter estimation method if we want, perhaps hand-specify closed
* class POS tags, and then train given tagged text.
*
* Specifying Feature Templates
*
* Feature templates inherit from the class {@link edu.stanford.nlp.tagger.maxent.Extractor}.
* The main job of an Extractor is to extract the value it is interested in from a history.
* Each instantiating feature for a given template will be true for a specific
* value extracted from a history and a specific target tag.
*
* For example, this is a common word extractor that extracts
* the current and next word.
*
*
* /**
* * This extractor extracts the current and the next word in conjunction.
* *\/
* class ExtractorCWordNextWord extends Extractor {
*
* private final static String excl="!";
*
* public ExtractorCWordNextWord() {}
*
* String extract(History h, PairsHolder pH) {
* String s = pH.get(h, 0, false) + excl + pH.get(h, 1, false);
* return s;
* }
*
* }
*
*
* The method extract(History h) is defined in the base class
* Extractor as:
* *
*
* String extract(History h) {
* return extract(h, GlobalHolder.pairs);
* }
*
*
* The PairsHolder contains an array of words and the tags. It
* has a get method that can be used to extract things from the history. In
* GlobalHolder.pairs , the whole training data is stored. String
* GlobalHolder.pairs.get(History
* h,int position, boolean isTag) , will return the tag or
* word (depending on isTag), at position position relative to the
* history h.
*
* Using this PairsHolder, we can extract features from the
* whole sentence including the current word. The History object is basically a
* specification of the start of the sentence, the current word, and the end of
* the sentence.
*
* In an extractor we can also specify for which tags to
* instantiate the template. The method
* boolean precondition(String tag)
is by default true, meaning
* that a feature can be created for every tag. Sometimes we would like to
* restrict that, and say that features should be created for only the VB and VBP
* tags, for example. In this case the method precondition has to be redefined to
* return false for all other tags.
*
* The extractors for common word features have to be placed in
* the static array ExtractorFrames.eFrames. The present state of this array for
* the best tagger is:
*
* public static Extractor[] eFrames={cWord,prevWord,nextWord,prevTag,nextTag,
* prevTwoTags,nextTwoTags,prevNextTag,prevTagWord,nextTagWord,cWordPrevWord,
* cWordNextWord};
*
*
* The extractors for rare word features commonly inherit from
* RareExtractor, which inherits from Extractor. RareExtractor provides
* some nice static methods for manipulating
* strings, such as seeing whether they contain numbers, etc. The rare word
* extractors have to be placed in the static array ExtractorFramesRare.eFrames.
* For example, for a good English tagger, this array might be:
*
*
* public static Extractor[] eFrames={cWordUppCase,cWordNumber,
* cWordDash,cWordSuff1,cWordSuff2,cWordSuff3,cWordSuff4,
* cAllCap,cMidSentence,cWordStartUCase,cWordMidUCase,
* cWordPref1,cWordPref2,cWordPref3,cWordPref4,
* new ExtractorCWordPref(5),new ExtractorCWordPref(6),
* new ExtractorCWordPref(7), new ExtractorCWordPref(8),
* new ExtractorCWordPref(9), new ExtractorCWordPref(10),
* new ExtractorCWordSuff(5),new ExtractorCWordSuff(6),
* new ExtractorCWordSuff(7),new ExtractorCWordSuff(8),
* new ExtractorCWordSuff(9), new ExtractorCWordSuff(10),
* cLetterDigitDash, cCompany,cAllCapitalized,cUpperDigitDash};
*
*
* At present, many of the extractor and rare extractor combinations can
* be flexibly set from a properties file by suitable specifications of the
* arch
option, whereas others require changing the code.
*
* Specifying closed-class POS tags
*
*
* By default, all POS tags are assumed to be open classes. In many cases,
* it is useful to specify POS tags which are closed class, and can only be
* applied to words seen in the training data (rather than being possible
* tags for new words seen at runtime). These closed class tags are
* specified for a language (where a "language" is really a
* (language,tag-set) pair: a different system of tagging is a new
* language). You do this by specifying the language in the properties
* file, and specifying the closed class tags for that language in
* {@link edu.stanford.nlp.tagger.maxent.TTags}.
*/
package edu.stanford.nlp.tagger.maxent;