All Downloads are FREE. Search and download functionalities are using the official Maven repository.

weka.classifiers.functions.supportVector.StringKernel Maven / Gradle / Ivy

Go to download

The Waikato Environment for Knowledge Analysis (WEKA), a machine learning workbench. This version represents the developer version, the "bleeding edge" of development, you could say. New functionality gets added to this version.

There is a newer version: 3.9.6
Show newest version
/*
 *   This program is free software: you can redistribute it and/or modify
 *   it under the terms of the GNU General Public License as published by
 *   the Free Software Foundation, either version 3 of the License, or
 *   (at your option) any later version.
 *
 *   This program is distributed in the hope that it will be useful,
 *   but WITHOUT ANY WARRANTY; without even the implied warranty of
 *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *   GNU General Public License for more details.
 *
 *   You should have received a copy of the GNU General Public License
 *   along with this program.  If not, see .
 */

/*
 * StringKernel.java
 * Copyright (C) 2006-2012 University of Waikato, Hamilton, New Zealand
 */

package weka.classifiers.functions.supportVector;

import java.util.Collections;
import java.util.Enumeration;
import java.util.Vector;

import weka.core.Attribute;
import weka.core.Capabilities;
import weka.core.Capabilities.Capability;
import weka.core.Instance;
import weka.core.Instances;
import weka.core.Option;
import weka.core.RevisionUtils;
import weka.core.SelectedTag;
import weka.core.Tag;
import weka.core.TechnicalInformation;
import weka.core.TechnicalInformation.Field;
import weka.core.TechnicalInformation.Type;
import weka.core.TechnicalInformationHandler;
import weka.core.Utils;

/**
 *  Implementation of the subsequence kernel (SSK) as
 * described in [1] and of the subsequence kernel with lambda pruning (SSK-LP)
 * as described in [2].
*
* For more information, see
*
* Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, Christopher * J. C. H. Watkins (2002). Text Classification using String Kernels. Journal of * Machine Learning Research. 2:419-444.
*
* F. Kleedorfer, A. Seewald (2005). Implementation of a String Kernel for WEKA. * Wien, Austria. *

* * * BibTeX: * *

 * @article{Lodhi2002,
 *    author = {Huma Lodhi and Craig Saunders and John Shawe-Taylor and Nello Cristianini and Christopher J. C. H. Watkins},
 *    journal = {Journal of Machine Learning Research},
 *    pages = {419-444},
 *    title = {Text Classification using String Kernels},
 *    volume = {2},
 *    year = {2002},
 *    HTTP = {http://www.jmlr.org/papers/v2/lodhi02a.html}
 * }
 * 
 * @techreport{Kleedorfer2005,
 *    address = {Wien, Austria},
 *    author = {F. Kleedorfer and A. Seewald},
 *    institution = {Oesterreichisches Forschungsinstitut fuer Artificial Intelligence},
 *    number = {TR-2005-13},
 *    title = {Implementation of a String Kernel for WEKA},
 *    year = {2005}
 * }
 * 
*

* * * Valid options are: *

* *

 * -D
 *  Enables debugging output (if available) to be printed.
 *  (default: off)
 * 
* *
 * -P <0|1>
 *  The pruning method to use:
 *  0 = No pruning
 *  1 = Lambda pruning
 *  (default: 0)
 * 
* *
 * -C <num>
 *  The size of the cache (a prime number).
 *  (default: 250007)
 * 
* *
 * -IC <num>
 *  The size of the internal cache (a prime number).
 *  (default: 200003)
 * 
* *
 * -L <num>
 *  The lambda constant. Penalizes non-continuous subsequence
 *  matches. Must be in (0,1).
 *  (default: 0.5)
 * 
* *
 * -ssl <num>
 *  The length of the subsequence.
 *  (default: 3)
 * 
* *
 * -ssl-max <num>
 *  The maximum length of the subsequence.
 *  (default: 9)
 * 
* *
 * -N
 *  Use normalization.
 *  (default: no)
 * 
* * * *

Theory

*

Overview

* The algorithm computes a measure of similarity between two texts based on the * number and form of their common subsequences, which need not be contiguous. * This method can be parametrized by specifying the subsequence length k, the * penalty factor lambda, which penalizes non-contiguous matches, and optional * 'lambda pruning', which takes maxLambdaExponent, m, as * parameter. Lambda pruning causes very 'stretched' substring matches not to be * counted, thus speeding up the computation. The functionality of SSK and * SSK-LP is explained in the following using simple examples. * *

Explanation & Examples

* for all of the following examples, we assume these parameter values: * *
 * k=2
 * lambda=0.5
 * m=8 (for SSK-LP examples)
 * 
* *

SSK

* *

Example 1

* *
 * SSK(2,"ab","axb")=0.5^5 = 0,03125
 * 
* * There is one subsequence of the length of 2 that both strings have in common, * "ab". The result of SSK is computed by raising lambda to the power of L, * where L is the length of the subsequence match in the one string plus the * length of the subsequence match in the other, in our case: * *
 *    ab    axb
 * L= 2  +   3 = 5
 * 
* * hence, the kernel yields 0.5^5 = 0,03125 * *

Example 2

* *
 * SSK(2,"ab","abb")=0.5^5 + 0.5^4 = 0,09375
 * 
* * Here, we also have one subsequence of the length of 2 that both strings have * in common, "ab". The result of SSK is actually computed by summing over all * values computed for each occurrence of a common subsequence match. In this * example, there are two possible cases: * *
 * ab    abb
 * --    --  L=4
 * --    - - L=5
 * 
* * we have two matches, one of the length of 2+2=4, one of the length of 2+3=5, * so we get the result 0.5^5 + 0.5^4 = 0,09375. * *

SSK-LP

* Without lambda pruning, the string kernel finds *all* common subsequences of * the given length, whereas with lambda pruning, common subsequence matches * that are too much stretched in both strings are not taken into account. It is * argued that the value yielded for such a common subsequence is too low ( * lambda ^(length[match_in_s] + length[match_in_t]) . Tests have * shown that a tremendous speedup can be achieved using this technique while * suffering from very little quality loss.
* Lambda pruning is parametrized by the maximum lambda exponent. It is a good * idea to choose that value to be about 3 or 4 times the subsequence length as * a rule of thumb. YMMV. * *

Example 3

* Without lambda pruning, one common subsequence, "AB" would be found in the * following two strings. (With k=2) * *
 * SSK(2,"ab","axb")=0.5^14 = 0,00006103515625
 * 
* * lambda pruning allows for the control of the match length. So, if m (the * maximum lambda exponent) is e.g. 8, these two strings would yield a kernel * value of 0: * *
 * with lambda pruning:    SSK-LP(2,8,"AxxxxxxxxxB","AyB")= 0
 * without lambda pruning: SSK(2,"AxxxxxxxxxB","AyB")= 0.5^14 = 0,00006103515625
 * 
* * This is because the exponent for lambda (=the length of the subsequence * match) would be 14, which is > 8. In Contrast, the next result is > 0 * *
 * m=8
 * SSK-LP(2,8,"AxxB","AyyB")=0.5^8 = 0,00390625
 * 
* * because the lambda exponent would be 8, which is just accepted by lambda * pruning. * *

Normalization

* When the string kernel is used for its main purpose, as the kernel of a * support vector machine, it is not normalized. The normalized kernel can be * switched on by -F (feature space normalization) but is much slower. Like most * unnormalized kernels, K(x,x) is not a fixed value, see the next example. * *

Example 4

* *
 * SSK(2,"ab","ab")=0.5^4 = 0.0625
 * SSK(2,"AxxxxxxxxxB","AxxxxxxxxxB") = 12.761724710464478
 * 
* * SSK is evaluated twice, each time for two identical strings. A good measure * of similarity would produce the same value in both cases, which should * indicate the same level of similarity. The value of the normalized SSK would * be 1.0 in both cases. So for the purpose of computing string similarity the * normalized kernel should be used. For SVM the unnormalized kernel is usually * sufficient. * *

Complexity of SSK and SSK-LP

* The time complexity of this method (without lambda pruning and with an * infinitely large cache) is
* *
 * O(k*|s|*|t|)
 * 
* * Lambda Pruning has a complexity (without caching) of
* *
 * O(m*binom(m,k)^2*(|s|+n)*|t|)
 * 
* *
* *
 * k...          subsequence length (ssl)
 * s,t...        strings
 * |s|...        length of string s
 * binom(x,y)... binomial coefficient (x!/[(x-y)!y!])
 * m...          maxLambdaExponent (ssl-max)
 * 
* * Keep in mind that execution time can increase fast for long strings and big * values for k, especially if you don't use lambda pruning. With lambda * pruning, computation is usually so fast that switching on the cache leads to * slower computation because of setup costs. Therefore caching is switched off * for lambda pruning.
*
* For details and qualitative experiments about SSK, see [1]
* For details about lambda pruning and performance comparison of SSK and SSK-LP * (SSK with lambda pruning), see [2] Note that the complexity estimation in [2] * assumes no caching of intermediate results, which has been implemented in the * meantime and greatly improves the speed of the SSK without lambda pruning.
* *

Notes for usage within Weka

* Only instances of the following form can be processed using string kernels: * *
 * +----------+-------------+---------------+
 * |attribute#|     0       |       1       |
 * +----------+-------------+---------------+
 * | content  | [text data] | [class label] |
 * +----------------------------------------+
 *  ... or ...
 * +----------+---------------+-------------+
 * |attribute#|     0         |     1       |
 * +----------+---------------+-------------+
 * | content  | [class label] | [text data] |
 * +----------------------------------------+
 * 
* * @author Florian Kleedorfer ([email protected]) * @author Alexander K. Seewald ([email protected]) * @version $Revision: 14512 $ */ public class StringKernel extends Kernel implements TechnicalInformationHandler { /** for serialization */ private static final long serialVersionUID = -4902954211202690123L; /** The size of the cache (a prime number) */ private int m_cacheSize = 250007; /** The size of the internal cache for intermediate results (a prime number) */ private int m_internalCacheSize = 200003; /** The attribute number of the string attribute */ private int m_strAttr; /** Kernel cache (i.e., cache for kernel evaluations) */ private double[] m_storage; private long[] m_keys; /** Counts the number of kernel evaluations. */ private int m_kernelEvals; /** The number of instance in the dataset */ private int m_numInsts; /** Pruning method: No Pruning */ public final static int PRUNING_NONE = 0; /** Pruning method: Lambda See [2] for details. */ public final static int PRUNING_LAMBDA = 1; /** Pruning methods */ public static final Tag[] TAGS_PRUNING = { new Tag(PRUNING_NONE, "No pruning"), new Tag(PRUNING_LAMBDA, "Lambda pruning"), }; /** the pruning method */ protected int m_PruningMethod = PRUNING_NONE; /** * the decay factor that penalizes non-continuous substring matches. See [1] * for details. */ protected double m_lambda = 0.5; /** The substring length */ private int m_subsequenceLength = 3; /** The maximum substring length for lambda pruning */ private int m_maxSubsequenceLength = 9; /** * powers of lambda are prepared prior to kernel evaluations. all powers * between 0 and this value are precalculated */ protected static final int MAX_POWER_OF_LAMBDA = 10000; /** the precalculated powers of lambda */ protected double[] m_powersOflambda = null; /** * flag for switching normalization on or off. This defaults to false and can * be turned on by the switch for feature space normalization in SMO */ private boolean m_normalize = false; /** private cache for intermediate results */ private int maxCache; // is set in unnormalizedKernel(s1,s2) private double[] cachekh; private int[] cachekhK; private double[] cachekh2; private int[] cachekh2K; /** cached indexes for private cache */ private int m_multX; private int m_multY; private int m_multZ; private int m_multZZ; private boolean m_useRecursionCache = true; /** * default constructor */ public StringKernel() { super(); } /** * creates a new StringKernel object. Initializes the kernel cache and the * 'lambda cache', i.e. the precalculated powers of lambda from lambda^2 to * lambda^MAX_POWER_OF_LAMBDA * * @param data the dataset to use * @param cacheSize the size of the cache * @param subsequenceLength the subsequence length * @param lambda the lambda value * @param debug whether to output debug information * @throws Exception if something goes wrong */ public StringKernel(Instances data, int cacheSize, int subsequenceLength, double lambda, boolean debug) throws Exception { setDebug(debug); setCacheSize(cacheSize); setInternalCacheSize(200003); setSubsequenceLength(subsequenceLength); setMaxSubsequenceLength(-1); setLambda(lambda); buildKernel(data); } /** * Returns a string describing the kernel * * @return a description suitable for displaying in the explorer/experimenter * gui */ @Override public String globalInfo() { return "Implementation of the subsequence kernel (SSK) as described in [1] " + "and of the subsequence kernel with lambda pruning (SSK-LP) as " + "described in [2].\n\n" + "For more information, see\n\n" + getTechnicalInformation().toString(); } /** * Returns an instance of a TechnicalInformation object, containing detailed * information about the technical background of this class, e.g., paper * reference or book this class is based on. * * @return the technical information about this class */ @Override public TechnicalInformation getTechnicalInformation() { TechnicalInformation result; TechnicalInformation additional; result = new TechnicalInformation(Type.ARTICLE); result .setValue( Field.AUTHOR, "Huma Lodhi and Craig Saunders and John Shawe-Taylor and Nello Cristianini and Christopher J. C. H. Watkins"); result.setValue(Field.YEAR, "2002"); result.setValue(Field.TITLE, "Text Classification using String Kernels"); result.setValue(Field.JOURNAL, "Journal of Machine Learning Research"); result.setValue(Field.VOLUME, "2"); result.setValue(Field.PAGES, "419-444"); result.setValue(Field.HTTP, "http://www.jmlr.org/papers/v2/lodhi02a.html"); additional = result.add(Type.TECHREPORT); additional.setValue(Field.AUTHOR, "F. Kleedorfer and A. Seewald"); additional.setValue(Field.YEAR, "2005"); additional.setValue(Field.TITLE, "Implementation of a String Kernel for WEKA"); additional.setValue(Field.INSTITUTION, "Oesterreichisches Forschungsinstitut fuer Artificial Intelligence"); additional.setValue(Field.ADDRESS, "Wien, Austria"); additional.setValue(Field.NUMBER, "TR-2005-13"); return result; } /** * Returns an enumeration describing the available options. * * @return an enumeration of all the available options. */ @Override public Enumeration




© 2015 - 2024 Weber Informatics LLC | Privacy Policy