cc.factorie.tutorial.UsersGuide740PosTagging.scala Maven / Gradle / Ivy

Show more of this group Show more artifacts with this name
Show all versions of factorie_2.11 Show documentation

FACTORIE is a toolkit for deployable probabilistic modeling, implemented as a software library in Scala. It provides its users with a succinct language for creating relational factor graphs, estimating parameters and performing inference.

The newest version!

/* Copyright (C) 2008-2016 University of Massachusetts Amherst.
   This file is part of "FACTORIE" (Factor graphs, Imperative, Extensible)
   http://factorie.cs.umass.edu, http://github.com/factorie
   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License. */
/*& Part-of-speech Tagging */
/*&

FACTORIE Part-of-Speech Tagger
==========================

## About ##

A part-of-speech tagger (POS tagger) is a program that takes human-language text as input and attempts to automatically determine the grammatical POS tags (noun, verb, etc.) of each token in the text. We achieve this by training a simple probabilistic model that, given a token of text, predicts its part-of-speech given its form and context in the sentence. Our tagger uses a fast and accurate forward model based heavily on the one described in the following paper:

> Jinho D. Choi and Martha Palmer. 2012. [Fast and Robust Part-of-Speech Tagging Using Dynamic Model Selection](http://aclweb.org/anthology//P/P12/P12-2071.pdf). In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL12), pp. 363--367.

Unlike the above, our tagger does not perform dynamic model selection. It is trained using regularized AdaGrad, with l1, l2, learning rate, delta, feature count cutoff, and number of training iterations hyperparameters tuned via grid search using a development set.

We provide two pre-trained models, one trained on the Ontonotes English corpus and one trained on Penn Treebank WSJ sections 0-18. The default tagger in FACTORIE is the model trained on Ontonotes, a cross-domain dataset guaranteeing relatively high inter-annotator accuracy, called OntonotesForwardPosTagger in the app.nlp.pos package.

Our tagger is both fast and accurate, processing more than 20K tokens/second and achieving 97.22% accuracy on WSJ sections 22-24.

## How to use ##

### From the command line ###

The easiest way to get started is by using our tagger through the FACTORIE NLP command line tool. Check out the [quick start guide](http://factorie.cs.umass.edu/usersguide/UsersGuide200QuickStart.html).

### As a Maven dependency ###

To use our tagger as a dependency in your Maven project, you will need to add the IESL repository:

```xml

  ...
  
    IESL Releases
    IESL Repo
    https://dev-iesl.cs.umass.edu/nexus/content/groups/public
    
      false
    
    
      true
    
  

```

Then include the dependencies for FACTORIE and our pre-trained POS models, if you choose not to train your own:

```xml

  ...
  
    cc.factorie
    factorie
    1.0.0-RC1
  

  
    cc.factorie.app.nlp
    all-models
    1.0-RC8
  

```

## Training your own models ##

You can also train your own POS tagger using FACTORIE. The following is an example script to train and test your own ForwardPosTagger on labeled data in whitespace-separated one-word-per-line format:

```bash
#!/bin/bash

memory=2g
fac_jar="/path/to/factorie-jar-with-dependencies-1.0-SNAPSHOT.jar"
modelname="ForwardPosTagger.factorie"

trainfile="--train-file=/path/to/training/data"
testfile="--test-file=/path/to/test/data"
save="--save-model=true"
model="--model=$modelname"

java -classpath $fac_jar -Xmx$memory cc.factorie.app.nlp.pos.ForwardPosTrainer --owpl $trainfile $testfile $save $model
```

This will train a model on the given training data, testing accuracy on the given test data, and saving the trained model to a file called “ForwardPosTagger.factorie”. If you are training on a lot of data, you may need to increase the amount of memory allocated to the JVM. For example, to train on the Ontonotes corpus requires about 16GB.

The above will learn the parameters of the model using regularized AdaGrad, log loss and some default hyperparameters. You can change the default values, as well as specify entire directories of training data, etc. using the command line parameters listed below:





  Parameter
  Default
  Description



  model
    
  Filename for the model (saving a trained model or reading a saved model)


  save-model
  false 
  Whether to save the trained model


  test-dir
   
  Directory containing test files


  train-dir
   
  Directory containing training files


  test-files
   
  Comma-separated list of testing files


  train-files
   
  Comma-separated list of training files


  owpl
  true 
  Whether the data are in OWPL format or otherwise (Ontonotes)


  l1
  0.000001 
  l1 regularization weight for AdaGradRDA


  l2
  0.00001 
  l2 regularization weight for AdaGradRDA


  rate
  1.0 
  Learning rate


  delta
  0.1 
  Learning rate decay


  cutoff
  2 
  Discard features less frequent than this before training


  update-examples
  true 
  Whether to update examples in later iterations during training


  use-hinge-loss
  false 
  Whether to use hinge loss or log loss during training


  num-iterations
  5 
  Number of passes over the data for training



 
You may want to perform hyperparameter optimization to find the right hyperparameters for your model. The ForwardPosOptimizer object contains good default ranges for optimizing the l1, l2, rate, delta, cutoff and number of training iterations for the POS tagger, spawning 200 jobs each with 16GB heap allocated to the JVM. For a more detailed explanation of FACTORIE’s hyperparameter optimization capabilities, see [the users guide] (http://factorie.cs.umass.edu/usersguide/UsersGuide490ParallelismAndHyperparameters.html).


*/

Parameter	Default	Description
model		Filename for the model (saving a trained model or reading a saved model)
save-model	false	Whether to save the trained model
test-dir		Directory containing test files
train-dir		Directory containing training files
test-files		Comma-separated list of testing files
train-files		Comma-separated list of training files
owpl	true	Whether the data are in OWPL format or otherwise (Ontonotes)
l1	0.000001	l1 regularization weight for AdaGradRDA
l2	0.00001	l2 regularization weight for AdaGradRDA
rate	1.0	Learning rate
delta	0.1	Learning rate decay
cutoff	2	Discard features less frequent than this before training
update-examples	true	Whether to update examples in later iterations during training
use-hinge-loss	false	Whether to use hinge loss or log loss during training
num-iterations	5	Number of passes over the data for training