com.aliasi.cluster.KMeansClusterer Maven / Gradle / Ivy
Show all versions of aliasi-lingpipe Show documentation
/*
* LingPipe v. 4.1.0
* Copyright (C) 2003-2011 Alias-i
*
* This program is licensed under the Alias-i Royalty Free License
* Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the Alias-i
* Royalty Free License Version 1 for more details.
*
* You should have received a copy of the Alias-i Royalty Free License
* Version 1 along with this program; if not, visit
* http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact
* Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211,
* +1 (718) 290-9170.
*/
package com.aliasi.cluster;
import com.aliasi.io.LogLevel;
import com.aliasi.io.Reporter;
import com.aliasi.io.Reporters;
import com.aliasi.stats.Statistics;
import com.aliasi.symbol.MapSymbolTable;
// import com.aliasi.util.Arrays;
import com.aliasi.util.Distance;
import com.aliasi.util.FeatureExtractor;
import com.aliasi.util.ObjectToDoubleMap;
import com.aliasi.util.SmallSet;
import java.util.Arrays;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.LinkedHashSet;
import java.util.List;
import java.util.Map;
import java.util.Random;
import java.util.Set;
/**
* A KMeansClusterer
provides an implementation of
* k-means(++) clustering based on vectors constructed by feature
* extractors. An instance fixes a specific value of K
, the
* number of clusters returned. Initialization may be either by
* the traditional k-means random assignment, or by the k-means++
* initialization strategy.
*
* This clustering class is defined so as to be able to cluster
* arbitrary objects. These objects are converted to (sparse) vectors
* using a feature extractor specified during construction.
*
*
Feature Parsing to Cluster Objects
*
*
The elements being clustered are first converted into feature
* d-dimensional vectors using a feature extractor. These feature
* vectors are then evenly distributed among the clusters using random
* assignment. Feature extractors may normalize their input in any
* number of ways. Run time is dominated by the density of the object
* vectors.
*
*
Centroids
*
*
In k-means, each cluster is modeled by the centroid of the
* feature vectors assigned to it. The centroid of a set of points is
* just the mean of the points, computed by dimension:
*
*
* centroid({v[0],...,v[n-1]}) = (v[0] + ... + v[n-1]) / n
*
* Thus centroids are thus located in the same vector space
* as the objects, namely they are d-dimensional vectors.
* The code represents centroids as dense vectors, and objects
* as sparse vectors.
*
* Euclidean Distance * *
Feature vectors are always compared to cluster centroids * using squared Euclidean distance, defined by: * *
* * The centroid of a set of points is the point that minimizes the sum * of square distances from the points in that set to the set. * ** distance(x,y)2 =(x - y) * (x - y) * = Σi (x[i] - y[i])2)
K-means * *
The k-means algorithm then iteratively improves cluster * assignments. Each epoch consists of two stages, reassignment * and recomputing the means. * *
Cluster Assignment: In each epoch, we assign each object * to the cluster represented by the closest centroid. Ties go to the * lowest indexed element. * *
Mean Recomputation: At the end of each epoch, the * centroids are then recomputed as the means of points assigned to * the cluster. * *
Convergence: If no objects change cluster during an * iteration, the algorithm has converged and the results will be returned. * We also consider the algorithm converged if the relative reduction * in error from epoch to epoch falls below a set threshold. Also, the * algorithm will return if the maximum number of epochs have been reached. * * *
K-means as Minimization * *
K-means clustering may be viewed as an iterative approach to the * minimization of the average sauare distance between items and their * cluster centers, which is: * *
* * where* Err(cs) = Σc in cs Σx in c distance(x,centroid(x))2
cs
is the set of clusters and centroid(x)
* is the centroid (mean or average) of the cluster containing x.
*
* Convergence Guarantees * *
K-means clustering is guaranteed to converge to a local mimium * of error because both steps of K-means reduce error. First, * assigning each object to its closest centroid minimizes error given * the centroids. Second, recalculating centroids as the means of * elements assigned to them minimizes errors given the clustering. * Given that error is bounded at zero, and changes are discrete, * k-means must eventually converge. While there are exponentially * many possible clusterings in theory, in practice k-means converges * very quickly. * * *
Local Minima and Multiple Runs * *
Like the EM algorithm, k-means clustering is highly sensitive to * the initial assignment of elements. In practice, it is often * helpful to apply k-means clustering repeatedly to the same input * data, returning the clustering with minimum error. * *
At the start of each iteration, the error of the previous * assignment is reported (at convergence, this will be the final * error). * *
Multiple runs may be used to provide bootstrap estimates * of the relatedness of any two elements. Bootstrap estimates * work by subsampling the elements to cluster with replacement * and then running k-means on them. This is repeated multiple * times, and the percentage of runs in which two elements fall * in the same cluster forms the bootstrap estimate of the likelihood * that they are in the same cluster. * * *
Degenerate Solutions * *
In some cases, the iterative approach taken by k-means leads to * a solution in which not every cluster is populated. This happens * during a step in which no feature vector is closest to a given * centroid. This is most likely to happen in highly skewed data in * high dimensional spaces. Sometimes, rerunning the clusterer with a * different initialization will find a solution with k clusters. * * *
Picking a good K
*
*
The number of clusters, k, may also be varied. In this case, * new k-means clustering instances must be created, as each uses * a fixed number of clusters. * *
By varying k
, the maximum number of clusters,
* the within-cluster scatter may be compared across different choices
* of k
. Typically, a value of k
is chosen
* at a knee of the within-cluster scatter scores. There are automatic
* ways of performing this selection, though they are heuristically
* rather than theoretically motivated.
*
*
In practice, it is technically possible, though unlikely, for * clusters to wind up with no elements in them. This implementation * will simply return fewer clusters than the maximum specified in * this case. * * *
K-Means++ Initialization * *
K-means++ is a k-means algorithm that attempts to make a good * randomized choice of initial centers. This requires choosing a * diverse set of centers, but not overpopulating the initial set of * centers with outliers.
K-means++ has reasonable expected * performance bounds in theory, and quite good performance in * practice. * *
Suppose we have K clusters and a set X of size at least K. * K-means++ chooses a single point c[k] in X as each initial * centroid, using the following strategy: * *
* * where D(x) is the minimum distance to an existing centroid: * ** 1. Sample the first centroid c[1] randomly from X. * * 2. For k = 2 to K * * Sample the next centroid c[k] = x * with probability proportional to D(x)2 *
* * After initialization, k-means++ proceeds just as traditional * k-means clustering. * ** D(x) = mink' < k d(x,c[k'])
The good expected behavior form k-means++ arises from choosing * the next center in such a way that it's far away from existing * centers. Many nearby points in some sense pool their behavior, * because the chance that one will be picked is the sum of the chance * that each will be picked. Outliers are, by definition, points x * with high values of D(x) but which are not near other points. * * *
Relation to Gaussian Mixtures and Expectation/Maximization * *
The k-means clustering algorithm is implicitly based on a * multi-dimensional Gaussian with independent dimensions with shared * means (the centroids) and equal variances. Estimates are carried * out by maximum likelihood; that is, no prior, or equivalently the * fully uninformative prior, is used. Where k-means differs from * standard expectation/maximization (EM) is that k-means reweights * the expectations so that the closest centroid gets an expectation * of 1.0 and all other centroids get an expectation of 0.0. This * approach has been called "winner-take-all" EM in the EM * literature. * * *
Implementation Notes * *
The current implementation conserves some computations versus * the brute-force approach by (1) only computing vector products * in comparing two vectors, and (2) not recomputing distances if * we know the * * *
References * *
-
*
- * MacQueen, J. B. 1967. Some Methods for classification and Analysis of Multivariate Observations. * Proceedings of Fifth Berkeley Symposium on Mathematical Statistics and Probability. * University of California Press. * *
- * Andrew Moore's K-Means Tutorial including most of the mathematics * * *
- * Matteo Matteucci's K-Means Tutorial including a very nice interactive servlet demo * * *
- * Hastie, T., R. Tibshirani and J.H. Friedman. 2001. The * Elements of Statistical Learning. Springer-Verlag. * * *
- * Wikipedia: K-means Algorithm * * *
- * Arthur, David and Sergei Vassilvitski (2007) k-means++: The Advantages of Careful Seeding. SODA 2007. * *
If the number of epochs is set to zero, the result
* will be random balanced clusterings of the specified size.
*
* @param featureExtractor Feature extractor for this clusterer.
* @param numClusters Number of clusters to return.
* @param maxEpochs Maximum number of epochs during
* optimization.
* @throws IllegalArgumentException If the number of clusters is
* less than 1, or if the maximum number of epochs is less
* than 0.
*/
KMeansClusterer(FeatureExtractor If the kMeansPlusPlus flag is set to {@code true}, the
* k-means++ initialization strategy is used. If it is set to
* false, initialization of each cluster will be handled by a
* random shuffling of all elements into a cluster.
*
* If the number of epochs is set to zero, the result will be a
* random balanced clusterings of the specified size.
*
* @param featureExtractor Feature extractor for this clusterer.
* @param numClusters Number of clusters to return.
* @param maxEpochs Maximum number of epochs during
* @param kMeansPlusPlus Set to {@code true} to use k-means++
* initialization. optimization.
* @param minImprovement Minimum relative improvement in squared
* distance scatter to keep going to the next epoch.
* @throws IllegalArgumentException If the number of clusters is
* less than 1, if the maximum number of epochs is less than 0, or
* if the minimum improvement is not a finite, non-negative number.
*/
public KMeansClusterer(FeatureExtractor Clustering fewer elements will result in fewer clusters.
*
* @return The number of clusters this clusterer will return.
*/
public int numClusters() {
return mMaxNumClusters;
}
/**
* Returns the maximum number of epochs for this clusterer.
*
* @return The maximum number of epochs.
*/
public int maxEpochs() {
return mMaxEpochs;
}
/**
* Return a k-means clustering of the specified set of elements
* using a freshly generated random number generator without
* intermediate reporting. The feature extractor, maximum number of
* epochs, number of clusters, and minimum relative improvement,
* and whether to use k-means++ initialization are defined in the
* class.
*
* This is just a utility method implementing the
* {@link Clusterer} interface. A call to
* {@code cluster(elementSet)} produces the same result
* as {@code cluster(elementSet,new Random(),null)}.
*
* See the class documentation above for more information.
*
* @param elementSet Set of elements to cluster.
* @return Clustering of the specified elements.
*/
public Set The reason this is a separate method is that typical
* implementations of {@link Random} are not thread safe,
* and rarely should reports from different clustering runs
* be interleaved to a reporter.
*
* Using a fixed random number generator (e.g. by using the
* same seed for {@link Random}) will result in the same
* resulting clustering, which can be useful for replicating
* tests.
*
* See the class documentation above for more information.
*
* @param elementSet Set of elements to cluster
* @param random Random number generator
* @param reporter Reporter to which progress reports are sent,
* or {@code null} if no reporting is required.
*/
public Set This method allows users to specify their own
* initial clusterings, which are then reallocated using the
* standard k-means algorithm.
*
* The number of clusters produced will be the size of the
* initial clustering, which may not match the number of clusters
* defined in the constructor.
*
* @param clustering Clustering to recluster.
* @param maxEpochs Maximum number of reclustering epochs.
* @return New clustering of input elements.
* @throws IllegalArgumentException If there are empty clusters
* in the clustering or if an element belongs to more than
* one cluster.
*/
Setk
" in
* "k-means".
*
*