com.aliasi.classify.NaiveBayesClassifier Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of aliasi-lingpipe Show documentation
This is the original Lingpipe: http://alias-i.com/lingpipe/web/download.html There were not made any changes to the source code.
There is a newer version: 4.1.2-JL1.0
Show newest version
/*
 * LingPipe v. 4.1.0
 * Copyright (C) 2003-2011 Alias-i
 *
 * This program is licensed under the Alias-i Royalty Free License
 * Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the Alias-i
 * Royalty Free License Version 1 for more details.
 * 
 * You should have received a copy of the Alias-i Royalty Free License
 * Version 1 along with this program; if not, visit
 * http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact
 * Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211,
 * +1 (718) 290-9170.
 */

package com.aliasi.classify;

import com.aliasi.tokenizer.TokenizerFactory;

import com.aliasi.lm.LanguageModel;
import com.aliasi.lm.NGramBoundaryLM;
import com.aliasi.lm.TokenizedLM;
import com.aliasi.lm.UniformBoundaryLM;

/**
 * A NaiveBayesClassifier provides a trainable naive Bayes
 * text classifier, with tokens as features.  A classifier is
 * constructed from a set of categories and a tokenizer factory.  The
 * token estimator is a unigram token language model with a uniform
 * whitespace model and an optional n-gram character language model
 * for smoothing unknown tokens.
 *
 * Naive Bayes applied to tokenized text results in a so-called
 * "bag of words" model where the tokens (words) are assumed
 * to be independent of one another:
 *
 * 

 * P(tokens|cat) 
 * = Π_{_{i<tokens.length}}
 *       P(tokens[i]|cat)
 * 
 *
 * This class implements this assumption by plugging unigram token
 * language models into a dynamic language model classifier.  The
 * unigram token language model makes the naive Bayes assumption by
 * virtue of having no tokens of context.
 *
 * The unigram model smooths maximum likelihood token estimates
 * with a character-level model.  Unfolding the general definition of
 * that class to the unigram case yields the model:
 *
 * 

 *  P(token|cat) 
 * 
 = P_{_tokenLM(cat)}(token) 
 * 
 = λ * count(token,cat) / totalCount(cat)
 * 
   + (1 - λ) * P_charLM(cat)(Word)
 * 
 *
 * where tokenLM(cat) is the token language model defined
 * for the specified category and charLM(cat) is the
 * character level language model it uses for smoothing.  The unigram
 * token model is based on counts count(token,cat) of a
 * token in the category and an overall count
 * totalCount(cat) of tokens in the category.  The
 * interpolation factor λ is computed as per the
 * Witten-Bell model C with hyperparameter one:
 *
 * 
 *  &lambda = totalCount(cat) / (totalCount(cat) + numTokens(cat))
 * 
 *
 * Roughly, the probability mass smoothed from the token model is
 * equal to the number of first-sightings of tokens in the training
 * data.
 *
 * If this character smoothing model is uniform, there are two
 * extremes that need to be balanced, especially in cases where there
 * is not very much training data per category.  If it is in
 * initialized with the true number of characters, it will return a
 * proper uniform character estimate.  In practice, this will probably
 * underestimate unknown tokens and thus categories in which they are
 * unknown will pay a high penalty.  If the token smoothing model is
 * initalized with zero as the max number of characters, the token
 * backoff will always be zero and thus not contribute to the
 * classification scores.  This will overestimate unknown tokens for
 * classification, with probabilities summing to more than one.  In
 * practice, it will probably not penalize unknown words in categories
 * enough.  If the cost is greater than zero, it will be linear in the
 * length of the unknown token.
 * 
 * 
Another way to smooth unknown tokens is to provide each model at
 * least one instance of each token known to every other model, so
 * there are no tokens known to one model and not another.  But this
 * adds an additional smoothing bias to the maximum likelihood
 * character estimates which may or may not be helpful.
 *
 * 
The unigram model is constructed with a whitespace model that
 * returns a constant zero estimate, {@link UniformBoundaryLM#ZERO_LM},
 * and thus contributes no probability mass to estimates.
 * 
 * 
As with the other language model classifiers, the conditional
 * category probability ratios are determined with a category
 * distribution and inversion:
 *
 * 

 * ARGMAX_{_cat} P(cat|tokens)
 * 
= ARGMAX_{_cat} P(cat,tokens) / P(tokens)
 * 
= ARGMAX_{_cat} P(cat,tokens)
 * 
= ARGMAX_{_cat} P(tokens|cat) * P(cat)
 * 
 *
 * The category probability model P(cat) is taken
 * to be a multivariate estimator with an initial count of one
 * for each category.
 *
 * For this class, the tokens are produced by a tokenizer factory.
 * This tokenizer factory may normalize tokens to stems, to lower
 * case, remove stop words, etc.  An extreme example would be to trim
 * the bag to a small set of salient words, as picked out by TF/IDF with
 * categories as documents.
 *
 * 
Compilation
 *
 * Instances of this class may be compiled and read back into
 * memory in the same way as other instances of {@link
 * DynamicLMClassifier} using the {@code compileTo()} method or
 * utiltiies in the class {@code
 * com.aliasi.util.AbstractExternalizable}.
 *
 * 
Deserializing After compilation, Deserialized instances of naive
 * Bayes classifiers should be cast to the interface {@code
 * JointClassifier}, though they may also be cast to
 * {@code LMClassifier}; the only
 * advantage to the latter cast is that you can still retrieve the
 * multivariate estimator over categories as well as the underlying
 * language model for each category.  These will be compiled instances.
 * 
 * 
Thread Safety
 *
 * Like almost all of LingPipe's statistical models, naive Bayes
 * classifiers are thread safe under read/write synchronization.
 * That is, any number of classification jobs may be performed
 * concurrently, but any parameter setting or training must be
 * done exclusively.
 * 
 * @author  Bob Carpenter
 * @version 3.0
 * @since   LingPipe2.0
 */
public class NaiveBayesClassifier 
    extends DynamicLMClassifier {


    /**
     * Construct a naive Bayes classifier with the specified
     * categories and tokenizer factory.  
     *
     * The character backoff models are assumed to be uniform
     * and there is no limit on the number of observed characters
     * other than {@link Character#MAX_VALUE}.
     *
     * @param categories Categories into which to classify text.
     * @param tokenizerFactory Text tokenizer.
     * @throws IllegalArgumentException If there are not at least two
     * categories.
     */
    public NaiveBayesClassifier(String[] categories,
                                TokenizerFactory tokenizerFactory) {
        this(categories,tokenizerFactory,0);
    }         

    /**
     * Construct a naive Bayes classifier with the specified
     * categories, tokenizer factory and level of character n-gram for
     * smoothing token estimates.  If the character n-gram is less
     * than one, a uniform model will be used.  
     *
     * 
There is no limit on the number of observed characters
     * other than {@link Character#MAX_VALUE}.
     *
     * @param categories Categories into which to classify text.
     * @param tokenizerFactory Text tokenizer.
     * @param charSmoothingNGram Order of character n-gram used to
     * smooth token estimates.
     * @throws IllegalArgumentException If there are not at least two
     * categories.
     */
    public NaiveBayesClassifier(String[] categories,
                                TokenizerFactory tokenizerFactory,
                                int charSmoothingNGram) {
        this(categories,tokenizerFactory,
             charSmoothingNGram,Character.MAX_VALUE-1);
    }


    /**
     * Construct a naive Bayes classifier with the specified
     * categories, tokenizer factory and level of character n-gram for
     * smoothing token estimates, along with a specification of the
     * total number of characters in test and training instances.  If
     * the character n-gram is less than one, a uniform model will be
     * used.  
     * 
     * As noted in the class documentation above, setting the
     * max observed characters parameter to one effectively eliminates
     * estimates of the string of an unknown token.
     *
     * @param categories Categories into which to classify text.
     * @param tokenizerFactory Text tokenizer.
     * @param charSmoothingNGram Order of character n-gram used to
     * smooth token estimates.
     * @param maxObservedChars The maximum number of characters found
     * in the text of training and test sets.
     * @throws IllegalArgumentException If there are not at least two
     * categories or if the number of observed characters is less than 1
     * or more than the total number of characters.
     */
    public NaiveBayesClassifier(String[] categories,
                                TokenizerFactory tokenizerFactory,
                                int charSmoothingNGram,
                                int maxObservedChars) {
        super(categories,
              naiveBayesLMs(categories.length,
                            tokenizerFactory,
                            charSmoothingNGram,
                            maxObservedChars));
    }

    // construct the LMs for categories
    private static TokenizedLM[] 
        naiveBayesLMs(int length, TokenizerFactory tokenizerFactory,
                      int charSmoothingNGram, int maxObservedChars) {
    
        TokenizedLM[] lms = new TokenizedLM[length];
        for (int i = 0; i < lms.length; ++i) {
            LanguageModel.Sequence charLM;
            if (charSmoothingNGram < 1)
                charLM = new UniformBoundaryLM(maxObservedChars);
            else
                charLM = new NGramBoundaryLM(charSmoothingNGram,
                                             maxObservedChars);
            lms[i]
                = new TokenizedLM(tokenizerFactory,
                                  1,
                                  charLM,
                                  UniformBoundaryLM.ZERO_LM,
                                  1);
        }
        return lms;
    }
}