All Downloads are FREE. Search and download functionalities are using the official Maven repository.

cc.factorie.tutorial.UsersGuide740PosTagging.scala Maven / Gradle / Ivy

Go to download

FACTORIE is a toolkit for deployable probabilistic modeling, implemented as a software library in Scala. It provides its users with a succinct language for creating relational factor graphs, estimating parameters and performing inference.

The newest version!
/* Copyright (C) 2008-2016 University of Massachusetts Amherst.
   This file is part of "FACTORIE" (Factor graphs, Imperative, Extensible)
   http://factorie.cs.umass.edu, http://github.com/factorie
   Licensed under the Apache License, Version 2.0 (the "License");
   you may not use this file except in compliance with the License.
   You may obtain a copy of the License at
    http://www.apache.org/licenses/LICENSE-2.0
   Unless required by applicable law or agreed to in writing, software
   distributed under the License is distributed on an "AS IS" BASIS,
   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
   See the License for the specific language governing permissions and
   limitations under the License. */
/*& Part-of-speech Tagging */
/*&

FACTORIE Part-of-Speech Tagger
==========================

## About ##

A part-of-speech tagger (POS tagger) is a program that takes human-language text as input and attempts to automatically determine the grammatical POS tags (noun, verb, etc.) of each token in the text. We achieve this by training a simple probabilistic model that, given a token of text, predicts its part-of-speech given its form and context in the sentence. Our tagger uses a fast and accurate forward model based heavily on the one described in the following paper:

> Jinho D. Choi and Martha Palmer. 2012. [Fast and Robust Part-of-Speech Tagging Using Dynamic Model Selection](http://aclweb.org/anthology//P/P12/P12-2071.pdf). In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (ACL12), pp. 363--367.

Unlike the above, our tagger does not perform dynamic model selection. It is trained using regularized AdaGrad, with l1, l2, learning rate, delta, feature count cutoff, and number of training iterations hyperparameters tuned via grid search using a development set.

We provide two pre-trained models, one trained on the Ontonotes English corpus and one trained on Penn Treebank WSJ sections 0-18. The default tagger in FACTORIE is the model trained on Ontonotes, a cross-domain dataset guaranteeing relatively high inter-annotator accuracy, called OntonotesForwardPosTagger in the app.nlp.pos package.

Our tagger is both fast and accurate, processing more than 20K tokens/second and achieving 97.22% accuracy on WSJ sections 22-24.

## How to use ##

### From the command line ###

The easiest way to get started is by using our tagger through the FACTORIE NLP command line tool. Check out the [quick start guide](http://factorie.cs.umass.edu/usersguide/UsersGuide200QuickStart.html).

### As a Maven dependency ###

To use our tagger as a dependency in your Maven project, you will need to add the IESL repository:

```xml

  ...
  
    IESL Releases
    IESL Repo
    https://dev-iesl.cs.umass.edu/nexus/content/groups/public
    
      false
    
    
      true
    
  

```

Then include the dependencies for FACTORIE and our pre-trained POS models, if you choose not to train your own:

```xml

  ...
  
    cc.factorie
    factorie
    1.0.0-RC1
  

  
    cc.factorie.app.nlp
    all-models
    1.0-RC8
  

```

## Training your own models ##

You can also train your own POS tagger using FACTORIE. The following is an example script to train and test your own ForwardPosTagger on labeled data in whitespace-separated one-word-per-line format:

```bash
#!/bin/bash

memory=2g
fac_jar="/path/to/factorie-jar-with-dependencies-1.0-SNAPSHOT.jar"
modelname="ForwardPosTagger.factorie"

trainfile="--train-file=/path/to/training/data"
testfile="--test-file=/path/to/test/data"
save="--save-model=true"
model="--model=$modelname"

java -classpath $fac_jar -Xmx$memory cc.factorie.app.nlp.pos.ForwardPosTrainer --owpl $trainfile $testfile $save $model
```

This will train a model on the given training data, testing accuracy on the given test data, and saving the trained model to a file called “ForwardPosTagger.factorie”. If you are training on a lot of data, you may need to increase the amount of memory allocated to the JVM. For example, to train on the Ontonotes corpus requires about 16GB.

The above will learn the parameters of the model using regularized AdaGrad, log loss and some default hyperparameters. You can change the default values, as well as specify entire directories of training data, etc. using the command line parameters listed below:

Parameter Default Description
model Filename for the model (saving a trained model or reading a saved model)
save-model false Whether to save the trained model
test-dir Directory containing test files
train-dir Directory containing training files
test-files Comma-separated list of testing files
train-files Comma-separated list of training files
owpl true Whether the data are in OWPL format or otherwise (Ontonotes)
l1 0.000001 l1 regularization weight for AdaGradRDA
l2 0.00001 l2 regularization weight for AdaGradRDA
rate 1.0 Learning rate
delta 0.1 Learning rate decay
cutoff 2 Discard features less frequent than this before training
update-examples true Whether to update examples in later iterations during training
use-hinge-loss false Whether to use hinge loss or log loss during training
num-iterations 5 Number of passes over the data for training
You may want to perform hyperparameter optimization to find the right hyperparameters for your model. The ForwardPosOptimizer object contains good default ranges for optimizing the l1, l2, rate, delta, cutoff and number of training iterations for the POS tagger, spawning 200 jobs each with 16GB heap allocated to the JVM. For a more detailed explanation of FACTORIE’s hyperparameter optimization capabilities, see [the users guide] (http://factorie.cs.umass.edu/usersguide/UsersGuide490ParallelismAndHyperparameters.html). */




© 2015 - 2024 Weber Informatics LLC | Privacy Policy