All Downloads are FREE. Search and download functionalities are using the official Maven repository.

com.aliasi.chunk.EstimatorTrie Maven / Gradle / Ivy

Go to download

This is the original Lingpipe: http://alias-i.com/lingpipe/web/download.html There were not made any changes to the source code.

There is a newer version: 4.1.2-JL1.0
Show newest version
/*
 * LingPipe v. 4.1.0
 * Copyright (C) 2003-2011 Alias-i
 *
 * This program is licensed under the Alias-i Royalty Free License
 * Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the Alias-i
 * Royalty Free License Version 1 for more details.
 * 
 * You should have received a copy of the Alias-i Royalty Free License
 * Version 1 along with this program; if not, visit
 * http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact
 * Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211,
 * +1 (718) 290-9170.
 */

package com.aliasi.chunk;

import java.io.IOException;
import java.io.ObjectInput;

/**
 * An estimator trie is a data structure for compactly representing
 * the contexts and outcomes in an estimator at varying levels
 * of detail to allow for the computation of smoothed outcomes at
 * runtime.
 *
 * 

The data file format for serialization of an estimator trie, as * read by the constructor {@link #EstimatorTrie(ObjectInput)} and * as written, for example, by {@link * com.aliasi.ne.TrainableEstimator#writeTo(java.io.DataOutputStream)}. * *

*

* * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *
CountTypeVariableContent
1intnumNodesNumber of nodes in trie
numNodesintnodeSymbolSymbol identifier for node
intnodeFirstOutcomeIndexIndex in outcome array of first outcome
intnodeFirstChildIndex in this array of first child node
floatnodeLogOneMinusLambdalog (1-lambda) for this node
intnodeBackoffIndex in this array of backoff node, or -1 if none
1intnumOutcomesNumber of outcomes in trie
numOutcomesintoutcomeSymbolSymbol identifier for outcome
intoutcomeLogEstimatelog probability estimate for this outcome
*
* * The parallel arrays are encoded one index at a time rather than one * whole array at a time. All values are stored in local variables * with the given names, with the context node and outcome array * entries being indexed by the node number and outcome number * respectively. *

* *

The parallel arrays are used for lookup as follows. A context * for estimating outcomes consists of a sequence of events, * represented here as symbols. The events to be kept the longest in * smoothing are at the front of the sequence and are looked up first. * The first entry (index 0) in the nodes array * constitutes the root node or null context. It may or may not have * any outcomes on it. The algorithm for evaluation follows the * sequence of context events, stopping only when extending the * context is not possible. That node represents the most specific * context for which estimates can be made, and may not actually use * all of the contextual events if they haven't been seen, or seen * frequently enough, during training. From this most specific node, * an estimate is made for the outcome event in question. The * outcomes for the most specific context node are searched for the * outcome event being estimated. If it is found, the log estimate on * that outcome is returned. If it is not found, the backoff pointer * from the most specific node is followed, and the evaluation is done * again from the more general node, adding the log * (1-lambda) value as required by the smoothing model. If a * backoff is attempted from a most general node without a backoff * node, an estimate of log 0.0 = Double.NaN is returned, * or in the case of estimation with a specified uniform distribution, * the log of the uniform estimate is added to the log * (1-lambda) and returned. *

* *

Lookup in the arrays is carried out by binary search over the * arrays of symbol identifiers. For searching children in thre trie, * nodeFirstChild[i] is the index of the first child node * of node i, and nodeFirstChild[i+1] is the * index of the first child of node i+1, and so on. * Therefore, the binary search for children of i is * carried out over indices j such that * nodeFirstChild[i] <= j < nodeFirstChild[i+1]. This requires * the node symbol ids to be in sorted order in these ranges. The * outcomes are searched similarly, with outcomes for node * i searched on indices j such that * nodeFirstOutcomeIndex[i] <= j < * nodeFirstOutcomeIndex[i+1]. *

* *

The methods provided for estimation require the callers of * these classes to walk the trie. This avoids unnecessary * intermediate object create to call the trie, and allows the tries * themselves to be flexible with respect to their geometry. * Given a context node (identified by index) and a symbol (by integer * identifier) for an event, the (index of a) more specific context * node that includes the symbol can be retrieved if it exists. Given * a particular context node (identified by index), and a symbol * representing an outcome (identified by an integer from a symbol * table), the log estimate from the given context is returned by * {@link #estimateFromNode(int,int)}. If the outcome doesn't exist * at any level of backoff, Double.NaN is returned. The * method {@link #estimateFromNodeUniform(int,int,double)} includes a * double value to return instead of Double.NaN as a * uniform estimate.

* * @author Bob Carpenter * @version 1.0 * @since LingPipe1.0 * @see OutcomeCounter * @see Node */ class EstimatorTrie { // Parallel arrays for trie nodes. /** Number of nodes in the trie, indexing the parallel trie * arrays. */ private final int _numNodes; /** Elements hold the identifiers for the symbol at the * element. */ private final int[] _nodeSymbol; /** Elements hold the index in the outcome arrays of the first * outcome. One longer than other arrays, with final element * holding the length for symmetry in all the bounded loops. */ private final int[] _nodeFirstOutcome; /** Elements hold the index in these arrays of the node that * is the first child of this node in the trie structure. One * longer than other arrays, with final element holding the * length for symmetry in all the bounded loops. */ private final int[] _nodeFirstChild; /** Elements hold the smoothing constant, log 1 - * lambda(context). */ private final float[] _nodeLogOneMinusLambda; /** Elements hold the index of the backoff node to check next * for the outcome saught. Will hold -1 if there * is no backoff node for the node. */ private final int[] _nodeBackoff; // Parallel arrays for outcomes given contexts. /** Number of outcomes total, indexing the parallel outcome * arrays. */ private final int _numOutcomes; /** Elements hold the identifier of the symbol represented by * the outcome. */ private final int[] _outcomeSymbol; /** Elements hold the log estimate of the outcome. */ private final float[] _outcomeLogEstimate; /** Create an estimator trie from a data input stream. Does * not close the stream so that an estimator trie may be read * out of a portion of a stream. * * @param in Data input stream from which to read this estimator's data. * @throws IOException If there is an exception reading from the * underlying stream. */ public EstimatorTrie(ObjectInput in) throws IOException { _numNodes = in.readInt(); _nodeSymbol = new int[_numNodes]; _nodeFirstOutcome = new int[_numNodes+1]; _nodeFirstChild = new int[_numNodes+1]; _nodeLogOneMinusLambda = new float[_numNodes]; _nodeBackoff = new int[_numNodes]; for (int i = 0; i < _numNodes; ++i) { _nodeSymbol[i] = in.readInt(); _nodeFirstOutcome[i] = in.readInt(); _nodeFirstChild[i] = in.readInt(); _nodeLogOneMinusLambda[i] = in.readFloat(); _nodeBackoff[i] = in.readInt(); } _nodeFirstChild[_numNodes] = _numNodes; // boundary _numOutcomes = in.readInt(); _nodeFirstOutcome[_numNodes] = _numOutcomes; // boundary _outcomeSymbol = new int[_numOutcomes]; _outcomeLogEstimate = new float[_numOutcomes]; for (int i = 0; i < _numOutcomes; ++i) { _outcomeSymbol[i] = in.readInt(); _outcomeLogEstimate[i] = in.readFloat(); } } /** * Returns the log estimate of the given outcome symbol from * the specified node, or {@link java.lang.Double#NaN} if * there is none. * * @param symbolID Outcome whose estimate is sought. * @param nodeIndex Index of node representing context of * estimate. * @return Log estimate of outcome symbol given node index or * Double.NaN if impossible. */ public double estimateFromNode(int symbolID, int nodeIndex) { if (symbolID < 0) return Double.NaN; double backoffAccumulator = 0.0; for(int currentNodeIndex = nodeIndex; currentNodeIndex >= 0; currentNodeIndex = _nodeBackoff[currentNodeIndex]) { int low = _nodeFirstOutcome[currentNodeIndex]; int high = _nodeFirstOutcome[currentNodeIndex+1]-1; while (low <= high) { int mid = (high + low)/2; if (_outcomeSymbol[mid] == symbolID) return backoffAccumulator + _outcomeLogEstimate[mid]; else if (_outcomeSymbol[mid] < symbolID) low = (low == mid ? mid+1 : mid); else high = (high == mid ? mid-1 : mid); } backoffAccumulator += _nodeLogOneMinusLambda[currentNodeIndex]; } return Double.NaN; // no more backoff nodes available } /** * Returns the log estimate of the given outcome symbol from * the specified node, or {@link java.lang.Double#NaN} if * there is none. Just like {@link #estimateFromNode} except * that final backoff will be to a uniform estimate that is * supplied as an argument. * * @param symbolID Outcome whose estimate is sought. * @param nodeIndex Index of node representing context of estimate. * @return Log estimate of outcome symbol given node index or * Double.NaN if impossible. */ public double estimateFromNodeUniform(int symbolID, int nodeIndex, double uniformEstimate) { if (symbolID < 0) return Double.NaN; double backoffAccumulator = 0.0; for (int currentNodeIndex = nodeIndex; currentNodeIndex >= 0; currentNodeIndex = _nodeBackoff[currentNodeIndex]) { int low = _nodeFirstOutcome[currentNodeIndex]; int high = _nodeFirstOutcome[currentNodeIndex+1]-1; while (low <= high) { int mid = (high + low)/2; if (_outcomeSymbol[mid] == symbolID) return backoffAccumulator + _outcomeLogEstimate[mid]; else if (_outcomeSymbol[mid] < symbolID) low = (low == mid ? mid+1 : mid); else high = (high == mid ? mid-1 : mid); } backoffAccumulator += _nodeLogOneMinusLambda[currentNodeIndex]; } return (backoffAccumulator + uniformEstimate); } /** * Returns the index in the trie of the child node of the * specified parent node index having the specified symbol ID * or -1 if it does not exist. * * @param symbolID Identifier of symbol of the child sought. * @param parentNodeIndex Index of parent node from which to * search for a child with specified identifier. * @return Index of the child node of the specified parent * node with specified symbol identifier. */ public int lookupChild(int symbolID, int parentNodeIndex) { int low = _nodeFirstChild[parentNodeIndex]; int high = _nodeFirstChild[parentNodeIndex+1]-1; if (symbolID < 0) return -1; while (low <= high) { int mid = (high + low)/2; if (_nodeSymbol[mid] == symbolID) return mid; else if (_nodeSymbol[mid] < symbolID) low = (low == mid ? mid+1 : mid); else high = (high == mid ? mid-1 : mid); } return -1; } }




© 2015 - 2025 Weber Informatics LLC | Privacy Policy