Many resources are needed to download a project. Please understand that we have to compensate our server costs. Thank you in advance. Project price only 1 $
You can buy this project and download/modify it how often you want.
The Waikato Environment for Knowledge Analysis (WEKA), a machine
learning workbench. This version represents the developer version, the
"bleeding edge" of development, you could say. New functionality gets added
to this version.
/*
* This program is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program. If not, see .
*/
/*
* StringKernel.java
* Copyright (C) 2006-2012 University of Waikato, Hamilton, New Zealand
*/
package weka.classifiers.functions.supportVector;
import java.util.Collections;
import java.util.Enumeration;
import java.util.Vector;
import weka.core.Attribute;
import weka.core.Capabilities;
import weka.core.Capabilities.Capability;
import weka.core.Instance;
import weka.core.Instances;
import weka.core.Option;
import weka.core.RevisionUtils;
import weka.core.SelectedTag;
import weka.core.Tag;
import weka.core.TechnicalInformation;
import weka.core.TechnicalInformation.Field;
import weka.core.TechnicalInformation.Type;
import weka.core.TechnicalInformationHandler;
import weka.core.Utils;
/**
* Implementation of the subsequence kernel (SSK) as
* described in [1] and of the subsequence kernel with lambda pruning (SSK-LP)
* as described in [2].
*
* For more information, see
*
* Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, Christopher
* J. C. H. Watkins (2002). Text Classification using String Kernels. Journal of
* Machine Learning Research. 2:419-444.
*
* F. Kleedorfer, A. Seewald (2005). Implementation of a String Kernel for WEKA.
* Wien, Austria.
*
*
*
* BibTeX:
*
*
* @article{Lodhi2002,
* author = {Huma Lodhi and Craig Saunders and John Shawe-Taylor and Nello Cristianini and Christopher J. C. H. Watkins},
* journal = {Journal of Machine Learning Research},
* pages = {419-444},
* title = {Text Classification using String Kernels},
* volume = {2},
* year = {2002},
* HTTP = {http://www.jmlr.org/papers/v2/lodhi02a.html}
* }
*
* @techreport{Kleedorfer2005,
* address = {Wien, Austria},
* author = {F. Kleedorfer and A. Seewald},
* institution = {Oesterreichisches Forschungsinstitut fuer Artificial Intelligence},
* number = {TR-2005-13},
* title = {Implementation of a String Kernel for WEKA},
* year = {2005}
* }
*
*
*
*
* Valid options are:
*
*
*
* -D
* Enables debugging output (if available) to be printed.
* (default: off)
*
*
*
* -P <0|1>
* The pruning method to use:
* 0 = No pruning
* 1 = Lambda pruning
* (default: 0)
*
*
*
* -C <num>
* The size of the cache (a prime number).
* (default: 250007)
*
*
*
* -IC <num>
* The size of the internal cache (a prime number).
* (default: 200003)
*
*
*
* -L <num>
* The lambda constant. Penalizes non-continuous subsequence
* matches. Must be in (0,1).
* (default: 0.5)
*
*
*
* -ssl <num>
* The length of the subsequence.
* (default: 3)
*
*
*
* -ssl-max <num>
* The maximum length of the subsequence.
* (default: 9)
*
*
*
* -N
* Use normalization.
* (default: no)
*
*
*
*
*
Theory
*
Overview
* The algorithm computes a measure of similarity between two texts based on the
* number and form of their common subsequences, which need not be contiguous.
* This method can be parametrized by specifying the subsequence length k, the
* penalty factor lambda, which penalizes non-contiguous matches, and optional
* 'lambda pruning', which takes maxLambdaExponent, m, as
* parameter. Lambda pruning causes very 'stretched' substring matches not to be
* counted, thus speeding up the computation. The functionality of SSK and
* SSK-LP is explained in the following using simple examples.
*
*
Explanation & Examples
* for all of the following examples, we assume these parameter values:
*
*
* k=2
* lambda=0.5
* m=8 (for SSK-LP examples)
*
*
*
SSK
*
*
Example 1
*
*
* SSK(2,"ab","axb")=0.5^5 = 0,03125
*
*
* There is one subsequence of the length of 2 that both strings have in common,
* "ab". The result of SSK is computed by raising lambda to the power of L,
* where L is the length of the subsequence match in the one string plus the
* length of the subsequence match in the other, in our case:
*
*
* ab axb
* L= 2 + 3 = 5
*
*
* hence, the kernel yields 0.5^5 = 0,03125
*
*
Example 2
*
*
* SSK(2,"ab","abb")=0.5^5 + 0.5^4 = 0,09375
*
*
* Here, we also have one subsequence of the length of 2 that both strings have
* in common, "ab". The result of SSK is actually computed by summing over all
* values computed for each occurrence of a common subsequence match. In this
* example, there are two possible cases:
*
*
* ab abb
* -- -- L=4
* -- - - L=5
*
*
* we have two matches, one of the length of 2+2=4, one of the length of 2+3=5,
* so we get the result 0.5^5 + 0.5^4 = 0,09375.
*
*
SSK-LP
* Without lambda pruning, the string kernel finds *all* common subsequences of
* the given length, whereas with lambda pruning, common subsequence matches
* that are too much stretched in both strings are not taken into account. It is
* argued that the value yielded for such a common subsequence is too low (
* lambda ^(length[match_in_s] + length[match_in_t]) . Tests have
* shown that a tremendous speedup can be achieved using this technique while
* suffering from very little quality loss.
* Lambda pruning is parametrized by the maximum lambda exponent. It is a good
* idea to choose that value to be about 3 or 4 times the subsequence length as
* a rule of thumb. YMMV.
*
*
Example 3
* Without lambda pruning, one common subsequence, "AB" would be found in the
* following two strings. (With k=2)
*
*
* SSK(2,"ab","axb")=0.5^14 = 0,00006103515625
*
*
* lambda pruning allows for the control of the match length. So, if m (the
* maximum lambda exponent) is e.g. 8, these two strings would yield a kernel
* value of 0:
*
*
* with lambda pruning: SSK-LP(2,8,"AxxxxxxxxxB","AyB")= 0
* without lambda pruning: SSK(2,"AxxxxxxxxxB","AyB")= 0.5^14 = 0,00006103515625
*
*
* This is because the exponent for lambda (=the length of the subsequence
* match) would be 14, which is > 8. In Contrast, the next result is > 0
*
*
*
* because the lambda exponent would be 8, which is just accepted by lambda
* pruning.
*
*
Normalization
* When the string kernel is used for its main purpose, as the kernel of a
* support vector machine, it is not normalized. The normalized kernel can be
* switched on by -F (feature space normalization) but is much slower. Like most
* unnormalized kernels, K(x,x) is not a fixed value, see the next example.
*
*
*
* SSK is evaluated twice, each time for two identical strings. A good measure
* of similarity would produce the same value in both cases, which should
* indicate the same level of similarity. The value of the normalized SSK would
* be 1.0 in both cases. So for the purpose of computing string similarity the
* normalized kernel should be used. For SVM the unnormalized kernel is usually
* sufficient.
*
*
Complexity of SSK and SSK-LP
* The time complexity of this method (without lambda pruning and with an
* infinitely large cache) is
*
*
* O(k*|s|*|t|)
*
*
* Lambda Pruning has a complexity (without caching) of
*
*
*
* Keep in mind that execution time can increase fast for long strings and big
* values for k, especially if you don't use lambda pruning. With lambda
* pruning, computation is usually so fast that switching on the cache leads to
* slower computation because of setup costs. Therefore caching is switched off
* for lambda pruning.
*
* For details and qualitative experiments about SSK, see [1]
* For details about lambda pruning and performance comparison of SSK and SSK-LP
* (SSK with lambda pruning), see [2] Note that the complexity estimation in [2]
* assumes no caching of intermediate results, which has been implemented in the
* meantime and greatly improves the speed of the SSK without lambda pruning.
*
*
Notes for usage within Weka
* Only instances of the following form can be processed using string kernels:
*
*
*
* @author Florian Kleedorfer ([email protected])
* @author Alexander K. Seewald ([email protected])
* @version $Revision: 14512 $
*/
public class StringKernel extends Kernel implements TechnicalInformationHandler {
/** for serialization */
private static final long serialVersionUID = -4902954211202690123L;
/** The size of the cache (a prime number) */
private int m_cacheSize = 250007;
/** The size of the internal cache for intermediate results (a prime number) */
private int m_internalCacheSize = 200003;
/** The attribute number of the string attribute */
private int m_strAttr;
/** Kernel cache (i.e., cache for kernel evaluations) */
private double[] m_storage;
private long[] m_keys;
/** Counts the number of kernel evaluations. */
private int m_kernelEvals;
/** The number of instance in the dataset */
private int m_numInsts;
/** Pruning method: No Pruning */
public final static int PRUNING_NONE = 0;
/** Pruning method: Lambda See [2] for details. */
public final static int PRUNING_LAMBDA = 1;
/** Pruning methods */
public static final Tag[] TAGS_PRUNING = {
new Tag(PRUNING_NONE, "No pruning"),
new Tag(PRUNING_LAMBDA, "Lambda pruning"), };
/** the pruning method */
protected int m_PruningMethod = PRUNING_NONE;
/**
* the decay factor that penalizes non-continuous substring matches. See [1]
* for details.
*/
protected double m_lambda = 0.5;
/** The substring length */
private int m_subsequenceLength = 3;
/** The maximum substring length for lambda pruning */
private int m_maxSubsequenceLength = 9;
/**
* powers of lambda are prepared prior to kernel evaluations. all powers
* between 0 and this value are precalculated
*/
protected static final int MAX_POWER_OF_LAMBDA = 10000;
/** the precalculated powers of lambda */
protected double[] m_powersOflambda = null;
/**
* flag for switching normalization on or off. This defaults to false and can
* be turned on by the switch for feature space normalization in SMO
*/
private boolean m_normalize = false;
/** private cache for intermediate results */
private int maxCache; // is set in unnormalizedKernel(s1,s2)
private double[] cachekh;
private int[] cachekhK;
private double[] cachekh2;
private int[] cachekh2K;
/** cached indexes for private cache */
private int m_multX;
private int m_multY;
private int m_multZ;
private int m_multZZ;
private boolean m_useRecursionCache = true;
/**
* default constructor
*/
public StringKernel() {
super();
}
/**
* creates a new StringKernel object. Initializes the kernel cache and the
* 'lambda cache', i.e. the precalculated powers of lambda from lambda^2 to
* lambda^MAX_POWER_OF_LAMBDA
*
* @param data the dataset to use
* @param cacheSize the size of the cache
* @param subsequenceLength the subsequence length
* @param lambda the lambda value
* @param debug whether to output debug information
* @throws Exception if something goes wrong
*/
public StringKernel(Instances data, int cacheSize, int subsequenceLength,
double lambda, boolean debug) throws Exception {
setDebug(debug);
setCacheSize(cacheSize);
setInternalCacheSize(200003);
setSubsequenceLength(subsequenceLength);
setMaxSubsequenceLength(-1);
setLambda(lambda);
buildKernel(data);
}
/**
* Returns a string describing the kernel
*
* @return a description suitable for displaying in the explorer/experimenter
* gui
*/
@Override
public String globalInfo() {
return "Implementation of the subsequence kernel (SSK) as described in [1] "
+ "and of the subsequence kernel with lambda pruning (SSK-LP) as "
+ "described in [2].\n\n"
+ "For more information, see\n\n"
+ getTechnicalInformation().toString();
}
/**
* Returns an instance of a TechnicalInformation object, containing detailed
* information about the technical background of this class, e.g., paper
* reference or book this class is based on.
*
* @return the technical information about this class
*/
@Override
public TechnicalInformation getTechnicalInformation() {
TechnicalInformation result;
TechnicalInformation additional;
result = new TechnicalInformation(Type.ARTICLE);
result
.setValue(
Field.AUTHOR,
"Huma Lodhi and Craig Saunders and John Shawe-Taylor and Nello Cristianini and Christopher J. C. H. Watkins");
result.setValue(Field.YEAR, "2002");
result.setValue(Field.TITLE, "Text Classification using String Kernels");
result.setValue(Field.JOURNAL, "Journal of Machine Learning Research");
result.setValue(Field.VOLUME, "2");
result.setValue(Field.PAGES, "419-444");
result.setValue(Field.HTTP, "http://www.jmlr.org/papers/v2/lodhi02a.html");
additional = result.add(Type.TECHREPORT);
additional.setValue(Field.AUTHOR, "F. Kleedorfer and A. Seewald");
additional.setValue(Field.YEAR, "2005");
additional.setValue(Field.TITLE,
"Implementation of a String Kernel for WEKA");
additional.setValue(Field.INSTITUTION,
"Oesterreichisches Forschungsinstitut fuer Artificial Intelligence");
additional.setValue(Field.ADDRESS, "Wien, Austria");
additional.setValue(Field.NUMBER, "TR-2005-13");
return result;
}
/**
* Returns an enumeration describing the available options.
*
* @return an enumeration of all the available options.
*/
@Override
public Enumeration