weka.classifiers.functions.supportVector.StringKernel Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of weka-dev Show documentation
The Waikato Environment for Knowledge Analysis (WEKA), a machine learning workbench. This version represents the developer version, the "bleeding edge" of development, you could say. New functionality gets added to this version.
There is a newer version: 3.9.6
Show newest version
/*
 *   This program is free software: you can redistribute it and/or modify
 *   it under the terms of the GNU General Public License as published by
 *   the Free Software Foundation, either version 3 of the License, or
 *   (at your option) any later version.
 *
 *   This program is distributed in the hope that it will be useful,
 *   but WITHOUT ANY WARRANTY; without even the implied warranty of
 *   MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 *   GNU General Public License for more details.
 *
 *   You should have received a copy of the GNU General Public License
 *   along with this program.  If not, see .
 */

/*
 * StringKernel.java
 * Copyright (C) 2006-2012 University of Waikato, Hamilton, New Zealand
 */

package weka.classifiers.functions.supportVector;

import java.util.Collections;
import java.util.Enumeration;
import java.util.Vector;

import weka.core.Attribute;
import weka.core.Capabilities;
import weka.core.Capabilities.Capability;
import weka.core.Instance;
import weka.core.Instances;
import weka.core.Option;
import weka.core.RevisionUtils;
import weka.core.SelectedTag;
import weka.core.Tag;
import weka.core.TechnicalInformation;
import weka.core.TechnicalInformation.Field;
import weka.core.TechnicalInformation.Type;
import weka.core.TechnicalInformationHandler;
import weka.core.Utils;

/**
 *  Implementation of the subsequence kernel (SSK) as
 * described in [1] and of the subsequence kernel with lambda pruning (SSK-LP)
 * as described in [2].

 * 

 * For more information, see

 * 

 * Huma Lodhi, Craig Saunders, John Shawe-Taylor, Nello Cristianini, Christopher
 * J. C. H. Watkins (2002). Text Classification using String Kernels. Journal of
 * Machine Learning Research. 2:419-444.

 * 

 * F. Kleedorfer, A. Seewald (2005). Implementation of a String Kernel for WEKA.
 * Wien, Austria.
 * 
 * 
 * 
 *  BibTeX:
 * 
 * 
 * @article{Lodhi2002,
 *    author = {Huma Lodhi and Craig Saunders and John Shawe-Taylor and Nello Cristianini and Christopher J. C. H. Watkins},
 *    journal = {Journal of Machine Learning Research},
 *    pages = {419-444},
 *    title = {Text Classification using String Kernels},
 *    volume = {2},
 *    year = {2002},
 *    HTTP = {http://www.jmlr.org/papers/v2/lodhi02a.html}
 * }
 * 
 * @techreport{Kleedorfer2005,
 *    address = {Wien, Austria},
 *    author = {F. Kleedorfer and A. Seewald},
 *    institution = {Oesterreichisches Forschungsinstitut fuer Artificial Intelligence},
 *    number = {TR-2005-13},
 *    title = {Implementation of a String Kernel for WEKA},
 *    year = {2005}
 * }
 * 
 * 
 * 
 * 
 *  Valid options are:
 * 

 * 
 * 
 * -D
 *  Enables debugging output (if available) to be printed.
 *  (default: off)
 * 
 *
 *  * -P <0|1>
 *  The pruning method to use:
 *  0 = No pruning
 *  1 = Lambda pruning
 *  (default: 0)
 * 
 * 
 *  * -C <num>
 *  The size of the cache (a prime number).
 *  (default: 250007)
 * 
 * 
 *  * -IC <num>
 *  The size of the internal cache (a prime number).
 *  (default: 200003)
 * 
 * 
 *  * -L <num>
 *  The lambda constant. Penalizes non-continuous subsequence
 *  matches. Must be in (0,1).
 *  (default: 0.5)
 * 
 * 
 *  * -ssl <num>
 *  The length of the subsequence.
 *  (default: 3)
 * 
 * 
 *  * -ssl-max <num>
 *  The maximum length of the subsequence.
 *  (default: 9)
 * 
 * 
 *  * -N
 *  Use normalization.
 *  (default: no)
 * 
 * 
 * 
 * 
 * Theory
 * Overview
 * The algorithm computes a measure of similarity between two texts based on the
 * number and form of their common subsequences, which need not be contiguous.
 * This method can be parametrized by specifying the subsequence length k, the
 * penalty factor lambda, which penalizes non-contiguous matches, and optional
 * 'lambda pruning', which takes maxLambdaExponent, m, as
 * parameter. Lambda pruning causes very 'stretched' substring matches not to be
 * counted, thus speeding up the computation. The functionality of SSK and
 * SSK-LP is explained in the following using simple examples.
 * 
 * Explanation & Examples
 * for all of the following examples, we assume these parameter values:
 * 
 *  * k=2
 * lambda=0.5
 * m=8 (for SSK-LP examples)
 * 
 * 
 * SSK
 * 
 * Example 1
 * 
 *  * SSK(2,"ab","axb")=0.5^5 = 0,03125
 * 
 * 
 * There is one subsequence of the length of 2 that both strings have in common,
 * "ab". The result of SSK is computed by raising lambda to the power of L,
 * where L is the length of the subsequence match in the one string plus the
 * length of the subsequence match in the other, in our case:
 * 
 *  *    ab    axb
 * L= 2  +   3 = 5
 * 
 * 
 * hence, the kernel yields 0.5^5 = 0,03125
 * 
 * Example 2
 * 
 *  * SSK(2,"ab","abb")=0.5^5 + 0.5^4 = 0,09375
 * 
 * 
 * Here, we also have one subsequence of the length of 2 that both strings have
 * in common, "ab". The result of SSK is actually computed by summing over all
 * values computed for each occurrence of a common subsequence match. In this
 * example, there are two possible cases:
 * 
 *  * ab    abb
 * --    --  L=4
 * --    - - L=5
 * 
 * 
 * we have two matches, one of the length of 2+2=4, one of the length of 2+3=5,
 * so we get the result 0.5^5 + 0.5^4 = 0,09375.
 * 
 * SSK-LP
 * Without lambda pruning, the string kernel finds *all* common subsequences of
 * the given length, whereas with lambda pruning, common subsequence matches
 * that are too much stretched in both strings are not taken into account. It is
 * argued that the value yielded for such a common subsequence is too low (
 * lambda ^(length[match_in_s] + length[match_in_t]) . Tests have
 * shown that a tremendous speedup can be achieved using this technique while
 * suffering from very little quality loss. 

 * Lambda pruning is parametrized by the maximum lambda exponent. It is a good
 * idea to choose that value to be about 3 or 4 times the subsequence length as
 * a rule of thumb. YMMV.
 * 
 * Example 3
 * Without lambda pruning, one common subsequence, "AB" would be found in the
 * following two strings. (With k=2)
 * 
 *  * SSK(2,"ab","axb")=0.5^14 = 0,00006103515625
 * 
 * 
 * lambda pruning allows for the control of the match length. So, if m (the
 * maximum lambda exponent) is e.g. 8, these two strings would yield a kernel
 * value of 0:
 * 
 *  * with lambda pruning:    SSK-LP(2,8,"AxxxxxxxxxB","AyB")= 0
 * without lambda pruning: SSK(2,"AxxxxxxxxxB","AyB")= 0.5^14 = 0,00006103515625
 * 
 * 
 * This is because the exponent for lambda (=the length of the subsequence
 * match) would be 14, which is > 8. In Contrast, the next result is > 0
 * 
 *  * m=8
 * SSK-LP(2,8,"AxxB","AyyB")=0.5^8 = 0,00390625
 * 
 * 
 * because the lambda exponent would be 8, which is just accepted by lambda
 * pruning.
 * 
 * Normalization
 * When the string kernel is used for its main purpose, as the kernel of a
 * support vector machine, it is not normalized. The normalized kernel can be
 * switched on by -F (feature space normalization) but is much slower. Like most
 * unnormalized kernels, K(x,x) is not a fixed value, see the next example.
 * 
 * Example 4
 * 
 *  * SSK(2,"ab","ab")=0.5^4 = 0.0625
 * SSK(2,"AxxxxxxxxxB","AxxxxxxxxxB") = 12.761724710464478
 * 
 * 
 * SSK is evaluated twice, each time for two identical strings. A good measure
 * of similarity would produce the same value in both cases, which should
 * indicate the same level of similarity. The value of the normalized SSK would
 * be 1.0 in both cases. So for the purpose of computing string similarity the
 * normalized kernel should be used. For SVM the unnormalized kernel is usually
 * sufficient.
 * 
 * Complexity of SSK and SSK-LP
 * The time complexity of this method (without lambda pruning and with an
 * infinitely large cache) is

 * 
 *  * O(k*|s|*|t|)
 * 
 * 
 * Lambda Pruning has a complexity (without caching) of

 * 
 *  * O(m*binom(m,k)^2*(|s|+n)*|t|)
 * 
 * 
 * 

 * 
 *  * k...          subsequence length (ssl)
 * s,t...        strings
 * |s|...        length of string s
 * binom(x,y)... binomial coefficient (x!/[(x-y)!y!])
 * m...          maxLambdaExponent (ssl-max)
 * 
 * 
 * Keep in mind that execution time can increase fast for long strings and big
 * values for k, especially if you don't use lambda pruning. With lambda
 * pruning, computation is usually so fast that switching on the cache leads to
 * slower computation because of setup costs. Therefore caching is switched off
 * for lambda pruning. 

 * 

 * For details and qualitative experiments about SSK, see [1] 

 * For details about lambda pruning and performance comparison of SSK and SSK-LP
 * (SSK with lambda pruning), see [2] Note that the complexity estimation in [2]
 * assumes no caching of intermediate results, which has been implemented in the
 * meantime and greatly improves the speed of the SSK without lambda pruning. 

 * 
 * Notes for usage within Weka
 * Only instances of the following form can be processed using string kernels:
 * 
 *  * +----------+-------------+---------------+
 * |attribute#|     0       |       1       |
 * +----------+-------------+---------------+
 * | content  | [text data] | [class label] |
 * +----------------------------------------+
 *  ... or ...
 * +----------+---------------+-------------+
 * |attribute#|     0         |     1       |
 * +----------+---------------+-------------+
 * | content  | [class label] | [text data] |
 * +----------------------------------------+
 * 
 * 
 * @author Florian Kleedorfer ([email protected])
 * @author Alexander K. Seewald ([email protected])
 * @version $Revision: 14512 $
 */
public class StringKernel extends Kernel implements TechnicalInformationHandler {

  /** for serialization */
  private static final long serialVersionUID = -4902954211202690123L;

  /** The size of the cache (a prime number) */
  private int m_cacheSize = 250007;

  /** The size of the internal cache for intermediate results (a prime number) */
  private int m_internalCacheSize = 200003;

  /** The attribute number of the string attribute */
  private int m_strAttr;

  /** Kernel cache (i.e., cache for kernel evaluations) */
  private double[] m_storage;
  private long[] m_keys;

  /** Counts the number of kernel evaluations. */
  private int m_kernelEvals;

  /** The number of instance in the dataset */
  private int m_numInsts;

  /** Pruning method: No Pruning */
  public final static int PRUNING_NONE = 0;
  /** Pruning method: Lambda See [2] for details. */
  public final static int PRUNING_LAMBDA = 1;
  /** Pruning methods */
  public static final Tag[] TAGS_PRUNING = {
    new Tag(PRUNING_NONE, "No pruning"),
    new Tag(PRUNING_LAMBDA, "Lambda pruning"), };

  /** the pruning method */
  protected int m_PruningMethod = PRUNING_NONE;

  /**
   * the decay factor that penalizes non-continuous substring matches. See [1]
   * for details.
   */
  protected double m_lambda = 0.5;

  /** The substring length */
  private int m_subsequenceLength = 3;

  /** The maximum substring length for lambda pruning */
  private int m_maxSubsequenceLength = 9;

  /**
   * powers of lambda are prepared prior to kernel evaluations. all powers
   * between 0 and this value are precalculated
   */
  protected static final int MAX_POWER_OF_LAMBDA = 10000;

  /** the precalculated powers of lambda */
  protected double[] m_powersOflambda = null;

  /**
   * flag for switching normalization on or off. This defaults to false and can
   * be turned on by the switch for feature space normalization in SMO
   */
  private boolean m_normalize = false;

  /** private cache for intermediate results */
  private int maxCache; // is set in unnormalizedKernel(s1,s2)
  private double[] cachekh;
  private int[] cachekhK;
  private double[] cachekh2;
  private int[] cachekh2K;
  /** cached indexes for private cache */
  private int m_multX;
  private int m_multY;
  private int m_multZ;
  private int m_multZZ;

  private boolean m_useRecursionCache = true;

  /**
   * default constructor
   */
  public StringKernel() {
    super();
  }

  /**
   * creates a new StringKernel object. Initializes the kernel cache and the
   * 'lambda cache', i.e. the precalculated powers of lambda from lambda^2 to
   * lambda^MAX_POWER_OF_LAMBDA
   * 
   * @param data the dataset to use
   * @param cacheSize the size of the cache
   * @param subsequenceLength the subsequence length
   * @param lambda the lambda value
   * @param debug whether to output debug information
   * @throws Exception if something goes wrong
   */
  public StringKernel(Instances data, int cacheSize, int subsequenceLength,
    double lambda, boolean debug) throws Exception {

    setDebug(debug);
    setCacheSize(cacheSize);
    setInternalCacheSize(200003);
    setSubsequenceLength(subsequenceLength);
    setMaxSubsequenceLength(-1);
    setLambda(lambda);

    buildKernel(data);
  }

  /**
   * Returns a string describing the kernel
   * 
   * @return a description suitable for displaying in the explorer/experimenter
   *         gui
   */
  @Override
  public String globalInfo() {
    return "Implementation of the subsequence kernel (SSK) as described in [1] "
      + "and of the subsequence kernel with lambda pruning (SSK-LP) as "
      + "described in [2].\n\n"
      + "For more information, see\n\n"
      + getTechnicalInformation().toString();
  }

  /**
   * Returns an instance of a TechnicalInformation object, containing detailed
   * information about the technical background of this class, e.g., paper
   * reference or book this class is based on.
   * 
   * @return the technical information about this class
   */
  @Override
  public TechnicalInformation getTechnicalInformation() {
    TechnicalInformation result;
    TechnicalInformation additional;

    result = new TechnicalInformation(Type.ARTICLE);
    result
      .setValue(
        Field.AUTHOR,
        "Huma Lodhi and Craig Saunders and John Shawe-Taylor and Nello Cristianini and Christopher J. C. H. Watkins");
    result.setValue(Field.YEAR, "2002");
    result.setValue(Field.TITLE, "Text Classification using String Kernels");
    result.setValue(Field.JOURNAL, "Journal of Machine Learning Research");
    result.setValue(Field.VOLUME, "2");
    result.setValue(Field.PAGES, "419-444");
    result.setValue(Field.HTTP, "http://www.jmlr.org/papers/v2/lodhi02a.html");

    additional = result.add(Type.TECHREPORT);
    additional.setValue(Field.AUTHOR, "F. Kleedorfer and A. Seewald");
    additional.setValue(Field.YEAR, "2005");
    additional.setValue(Field.TITLE,
      "Implementation of a String Kernel for WEKA");
    additional.setValue(Field.INSTITUTION,
      "Oesterreichisches Forschungsinstitut fuer Artificial Intelligence");
    additional.setValue(Field.ADDRESS, "Wien, Austria");
    additional.setValue(Field.NUMBER, "TR-2005-13");

    return result;
  }

  /**
   * Returns an enumeration describing the available options.
   * 
   * @return an enumeration of all the available options.
   */
  @Override
  public Enumeration