All Downloads are FREE. Search and download functionalities are using the official Maven repository.

com.aliasi.cluster.KMeansClusterer Maven / Gradle / Ivy

Go to download

This is the original Lingpipe: http://alias-i.com/lingpipe/web/download.html There were not made any changes to the source code.

There is a newer version: 4.1.2-JL1.0
Show newest version
/*
 * LingPipe v. 4.1.0
 * Copyright (C) 2003-2011 Alias-i
 *
 * This program is licensed under the Alias-i Royalty Free License
 * Version 1 WITHOUT ANY WARRANTY, without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the Alias-i
 * Royalty Free License Version 1 for more details.
 *
 * You should have received a copy of the Alias-i Royalty Free License
 * Version 1 along with this program; if not, visit
 * http://alias-i.com/lingpipe/licenses/lingpipe-license-1.txt or contact
 * Alias-i, Inc. at 181 North 11th Street, Suite 401, Brooklyn, NY 11211,
 * +1 (718) 290-9170.
 */

package com.aliasi.cluster;

import com.aliasi.io.LogLevel;
import com.aliasi.io.Reporter;
import com.aliasi.io.Reporters;

import com.aliasi.stats.Statistics;

import com.aliasi.symbol.MapSymbolTable;

// import com.aliasi.util.Arrays;
import com.aliasi.util.Distance;
import com.aliasi.util.FeatureExtractor;
import com.aliasi.util.ObjectToDoubleMap;
import com.aliasi.util.SmallSet;

import java.util.Arrays;
import java.util.ArrayList;
import java.util.HashSet;
import java.util.LinkedHashSet;
import java.util.List;
import java.util.Map;
import java.util.Random;
import java.util.Set;


/**
 * A KMeansClusterer provides an implementation of
 * k-means(++) clustering based on vectors constructed by feature
 * extractors.  An instance fixes a specific value of K, the
 * number of clusters returned.  Initialization may be either by
 * the traditional k-means random assignment, or by the k-means++
 * initialization strategy. 
 *
 * 

This clustering class is defined so as to be able to cluster * arbitrary objects. These objects are converted to (sparse) vectors * using a feature extractor specified during construction. * *

Feature Parsing to Cluster Objects * *

The elements being clustered are first converted into feature * d-dimensional vectors using a feature extractor. These feature * vectors are then evenly distributed among the clusters using random * assignment. Feature extractors may normalize their input in any * number of ways. Run time is dominated by the density of the object * vectors. * *

Centroids * *

In k-means, each cluster is modeled by the centroid of the * feature vectors assigned to it. The centroid of a set of points is * just the mean of the points, computed by dimension: * *

 * centroid({v[0],...,v[n-1]}) = (v[0] + ... + v[n-1]) / n
* * Thus centroids are thus located in the same vector space * as the objects, namely they are d-dimensional vectors. * The code represents centroids as dense vectors, and objects * as sparse vectors. * *

Euclidean Distance * *

Feature vectors are always compared to cluster centroids * using squared Euclidean distance, defined by: * *

 * distance(x,y)2 =(x - y) * (x - y)
 *                = Σi (x[i] - y[i])2)
* * The centroid of a set of points is the point that minimizes the sum * of square distances from the points in that set to the set. * *

K-means * *

The k-means algorithm then iteratively improves cluster * assignments. Each epoch consists of two stages, reassignment * and recomputing the means. * *

Cluster Assignment:   In each epoch, we assign each object * to the cluster represented by the closest centroid. Ties go to the * lowest indexed element. * *

Mean Recomputation:   At the end of each epoch, the * centroids are then recomputed as the means of points assigned to * the cluster. * *

Convergence:   If no objects change cluster during an * iteration, the algorithm has converged and the results will be returned. * We also consider the algorithm converged if the relative reduction * in error from epoch to epoch falls below a set threshold. Also, the * algorithm will return if the maximum number of epochs have been reached. * * *

K-means as Minimization * *

K-means clustering may be viewed as an iterative approach to the * minimization of the average sauare distance between items and their * cluster centers, which is: * *

 * Err(cs) = Σc in cs Σx in c distance(x,centroid(x))2
* * where cs is the set of clusters and centroid(x) * is the centroid (mean or average) of the cluster containing x. * *

Convergence Guarantees * *

K-means clustering is guaranteed to converge to a local mimium * of error because both steps of K-means reduce error. First, * assigning each object to its closest centroid minimizes error given * the centroids. Second, recalculating centroids as the means of * elements assigned to them minimizes errors given the clustering. * Given that error is bounded at zero, and changes are discrete, * k-means must eventually converge. While there are exponentially * many possible clusterings in theory, in practice k-means converges * very quickly. * * *

Local Minima and Multiple Runs * *

Like the EM algorithm, k-means clustering is highly sensitive to * the initial assignment of elements. In practice, it is often * helpful to apply k-means clustering repeatedly to the same input * data, returning the clustering with minimum error. * *

At the start of each iteration, the error of the previous * assignment is reported (at convergence, this will be the final * error). * *

Multiple runs may be used to provide bootstrap estimates * of the relatedness of any two elements. Bootstrap estimates * work by subsampling the elements to cluster with replacement * and then running k-means on them. This is repeated multiple * times, and the percentage of runs in which two elements fall * in the same cluster forms the bootstrap estimate of the likelihood * that they are in the same cluster. * * *

Degenerate Solutions * *

In some cases, the iterative approach taken by k-means leads to * a solution in which not every cluster is populated. This happens * during a step in which no feature vector is closest to a given * centroid. This is most likely to happen in highly skewed data in * high dimensional spaces. Sometimes, rerunning the clusterer with a * different initialization will find a solution with k clusters. * * *

Picking a good K * *

The number of clusters, k, may also be varied. In this case, * new k-means clustering instances must be created, as each uses * a fixed number of clusters. * *

By varying k, the maximum number of clusters, * the within-cluster scatter may be compared across different choices * of k. Typically, a value of k is chosen * at a knee of the within-cluster scatter scores. There are automatic * ways of performing this selection, though they are heuristically * rather than theoretically motivated. * *

In practice, it is technically possible, though unlikely, for * clusters to wind up with no elements in them. This implementation * will simply return fewer clusters than the maximum specified in * this case. * * *

K-Means++ Initialization * *

K-means++ is a k-means algorithm that attempts to make a good * randomized choice of initial centers. This requires choosing a * diverse set of centers, but not overpopulating the initial set of * centers with outliers.

K-means++ has reasonable expected * performance bounds in theory, and quite good performance in * practice. * *

Suppose we have K clusters and a set X of size at least K. * K-means++ chooses a single point c[k] in X as each initial * centroid, using the following strategy: * *

 * 1.  Sample the first centroid c[1] randomly from X.
 *
 * 2.  For k = 2 to K
 *
 *        Sample the next centroid c[k] = x
 *        with probability proportional to D(x)2
 * 
* * where D(x) is the minimum distance to an existing centroid: * *
 * D(x) = mink' < k d(x,c[k'])
* * After initialization, k-means++ proceeds just as traditional * k-means clustering. * *

The good expected behavior form k-means++ arises from choosing * the next center in such a way that it's far away from existing * centers. Many nearby points in some sense pool their behavior, * because the chance that one will be picked is the sum of the chance * that each will be picked. Outliers are, by definition, points x * with high values of D(x) but which are not near other points. * * *

Relation to Gaussian Mixtures and Expectation/Maximization * *

The k-means clustering algorithm is implicitly based on a * multi-dimensional Gaussian with independent dimensions with shared * means (the centroids) and equal variances. Estimates are carried * out by maximum likelihood; that is, no prior, or equivalently the * fully uninformative prior, is used. Where k-means differs from * standard expectation/maximization (EM) is that k-means reweights * the expectations so that the closest centroid gets an expectation * of 1.0 and all other centroids get an expectation of 0.0. This * approach has been called "winner-take-all" EM in the EM * literature. * * *

Implementation Notes * *

The current implementation conserves some computations versus * the brute-force approach by (1) only computing vector products * in comparing two vectors, and (2) not recomputing distances if * we know the * * *

References * *

    *
  • * MacQueen, J. B. 1967. Some Methods for classification and Analysis of Multivariate Observations. * Proceedings of Fifth Berkeley Symposium on Mathematical Statistics and Probability. * University of California Press. * *
  • * Andrew Moore's K-Means Tutorial including most of the mathematics *
  • * *
  • * Matteo Matteucci's K-Means Tutorial including a very nice interactive servlet demo *
  • * *
  • * Hastie, T., R. Tibshirani and J.H. Friedman. 2001. The * Elements of Statistical Learning. Springer-Verlag. *
  • * *
  • * Wikipedia: K-means Algorithm *
  • * *
  • * Arthur, David and Sergei Vassilvitski (2007) k-means++: The Advantages of Careful Seeding. SODA 2007. *
  • *
* * @author Bob Carpenter * @version 4.0.0 * @since LingPipe2.0 * @param the type of objects being clustered */ public class KMeansClusterer implements Clusterer { final FeatureExtractor mFeatureExtractor; final int mMaxNumClusters; final int mMaxEpochs; final boolean mKMeansPlusPlus; final double mMinRelativeImprovement; /** * Construct a k-means clusterer with the specified feature * extractor, number of clusters and limit on number of epochs * to run optimization. Initialization of each cluster is with * random shuffling of all elements into a cluster. * *

If the number of epochs is set to zero, the result * will be random balanced clusterings of the specified size. * * @param featureExtractor Feature extractor for this clusterer. * @param numClusters Number of clusters to return. * @param maxEpochs Maximum number of epochs during * optimization. * @throws IllegalArgumentException If the number of clusters is * less than 1, or if the maximum number of epochs is less * than 0. */ KMeansClusterer(FeatureExtractor featureExtractor, int numClusters, int maxEpochs) { this(featureExtractor,numClusters,maxEpochs, false,0.0); } /** * Construct a k-means clusterer with the specified feature * extractor, number of clusters, minimum improvement per cluster, * using either traditional or k-means++ initialization. * *

If the kMeansPlusPlus flag is set to {@code true}, the * k-means++ initialization strategy is used. If it is set to * false, initialization of each cluster will be handled by a * random shuffling of all elements into a cluster. * *

If the number of epochs is set to zero, the result will be a * random balanced clusterings of the specified size. * * @param featureExtractor Feature extractor for this clusterer. * @param numClusters Number of clusters to return. * @param maxEpochs Maximum number of epochs during * @param kMeansPlusPlus Set to {@code true} to use k-means++ * initialization. optimization. * @param minImprovement Minimum relative improvement in squared * distance scatter to keep going to the next epoch. * @throws IllegalArgumentException If the number of clusters is * less than 1, if the maximum number of epochs is less than 0, or * if the minimum improvement is not a finite, non-negative number. */ public KMeansClusterer(FeatureExtractor featureExtractor, int numClusters, int maxEpochs, boolean kMeansPlusPlus, double minImprovement) { if (numClusters < 1) { String msg = "Number of clusters must be positive." + " Found numClusters=" + numClusters; throw new IllegalArgumentException(msg); } if (maxEpochs < 0) { String msg = "Number of epochs must be non-negative." + " Found maxEpochs=" + maxEpochs; throw new IllegalArgumentException(msg); } if (minImprovement < 0.0 || Double.isNaN(minImprovement)) { String msg = "Mimium improvement must be non-negative." + " Found minImprovement=" + minImprovement; throw new IllegalArgumentException(msg); } mFeatureExtractor = featureExtractor; mMaxNumClusters = numClusters; mMaxEpochs = maxEpochs; mKMeansPlusPlus = kMeansPlusPlus; mMinRelativeImprovement = minImprovement; } /** * Returns the feature extractor for this clusterer. * * @return The feature extractor for this clusterer. */ public FeatureExtractor featureExtractor() { return mFeatureExtractor; } /** * Returns the maximum number of clusters this clusterer will * return. This is the "k" in * "k-means". * *

Clustering fewer elements will result in fewer clusters. * * @return The number of clusters this clusterer will return. */ public int numClusters() { return mMaxNumClusters; } /** * Returns the maximum number of epochs for this clusterer. * * @return The maximum number of epochs. */ public int maxEpochs() { return mMaxEpochs; } /** * Return a k-means clustering of the specified set of elements * using a freshly generated random number generator without * intermediate reporting. The feature extractor, maximum number of * epochs, number of clusters, and minimum relative improvement, * and whether to use k-means++ initialization are defined in the * class. * *

This is just a utility method implementing the * {@link Clusterer} interface. A call to * {@code cluster(elementSet)} produces the same result * as {@code cluster(elementSet,new Random(),null)}. * *

See the class documentation above for more information. * * @param elementSet Set of elements to cluster. * @return Clustering of the specified elements. */ public Set> cluster(Set elementSet) { return cluster(elementSet,new Random(),null); } /** * Return the k-means clustering for the specified set of * elements, using the specified random number generator, sending * progress reports to the specified reporter. The feature * extractor, maximum number of epochs, number of clusters, and * minimum relative improvement, and whether to use k-means++ * initialization are defined in the class. * *

The reason this is a separate method is that typical * implementations of {@link Random} are not thread safe, * and rarely should reports from different clustering runs * be interleaved to a reporter. * *

Using a fixed random number generator (e.g. by using the * same seed for {@link Random}) will result in the same * resulting clustering, which can be useful for replicating * tests. * *

See the class documentation above for more information. * * @param elementSet Set of elements to cluster * @param random Random number generator * @param reporter Reporter to which progress reports are sent, * or {@code null} if no reporting is required. */ public Set> cluster(Set elementSet, Random random, Reporter reporter) { if (reporter == null) reporter = Reporters.silent(); final int numElements = elementSet.size(); final int numClusters = mMaxNumClusters; reporter.report(LogLevel.INFO, "#Elements=" + numElements); reporter.report(LogLevel.INFO, "#Clusters=" + numClusters); if (numElements <= numClusters) { reporter.report(LogLevel.INFO, "Returning trivial clustering due to #elements < #clusters"); return trivialClustering(elementSet); } @SuppressWarnings("unchecked") final E[] elements = (E[]) elementSet.toArray(new Object[0]); reporter.report(LogLevel.DEBUG,"Converting inputs to sparse vectors"); final int[][] featuress = new int[numElements][]; final double[][] valss = new double[numElements][]; final double[] eltSqLengths = new double[numElements]; MapSymbolTable symTab = toVectors(elements,featuress,valss,eltSqLengths); int numDims = symTab.numSymbols(); reporter.report(LogLevel.INFO,"#Dimensions=" + numDims); final double[][] centroidss = new double[numClusters][numDims]; final int[] closestCenters = new int[numElements]; final double[] sqDistToCenters = new double[numElements]; if (true) { reporter.report(LogLevel.INFO,"K-Means++ Initialization"); kmeansPlusPlusInit(featuress,valss,eltSqLengths, closestCenters, centroidss, random); } else { reporter.report(LogLevel.INFO,"K-Means Random Initialization"); randomInit(featuress,valss, closestCenters, centroidss, random); } return kMeansEpochs(elements,eltSqLengths, centroidss, featuress,valss, sqDistToCenters,closestCenters, mMaxEpochs,reporter); } /** * Returns the minimum reduction in relative error required to * continue to the next epoch during clustering. * * @return The minimum improvement per epoch. */ public double minRelativeImprovement() { return mMinRelativeImprovement; } /** * Recluster the specified initial clustering, adding in the * unclustered elements, reporting progress to the specified * reporter. The number of clusters is set by the size of the * initial clustering as measured by the number of non-empty * clusters it contains. This clusterer's maximum number of epochs and minimum * improvement will be used. * * @param initialClustering Initialization of clustering. * @param unclusteredElements Elements that have not been clustered. * @param reporter Reporter to which to send progress reports, * or {@code null} for no reporting. * @throws IllegalArgumentException If there are empty clusters * in the clustering or if an element belongs to more than * one cluster. * @return The reclustering. */ public Set> recluster(Set> initialClustering, Set unclusteredElements, Reporter reporter) { return recluster(initialClustering,unclusteredElements, mMaxEpochs,reporter); } /** * Recluster the specified clustering using up to the specified * number of k-means epochs with no reporting. * *

This method allows users to specify their own * initial clusterings, which are then reallocated using the * standard k-means algorithm. * *

The number of clusters produced will be the size of the * initial clustering, which may not match the number of clusters * defined in the constructor. * * @param clustering Clustering to recluster. * @param maxEpochs Maximum number of reclustering epochs. * @return New clustering of input elements. * @throws IllegalArgumentException If there are empty clusters * in the clustering or if an element belongs to more than * one cluster. */ Set> recluster(Set> clustering, int maxEpochs) { return recluster(clustering, SmallSet.create(), maxEpochs, null); } private Set> recluster(Set> clustering, Set unclusteredElements, int maxEpochs, Reporter reporter) { if (reporter == null) reporter = Reporters.silent(); reporter.report(LogLevel.INFO, "Reclustering"); int numClusters = clustering.size(); reporter.report(LogLevel.INFO, "# Clusters=" + numClusters); Set elementSet = new HashSet(); for (Set cluster : clustering) { for (E e : cluster) { if (!elementSet.add(e)) { String msg = "An element must not be in two clusters." + " Found an element in two clusters." + " Element=" + e; throw new IllegalArgumentException(msg); } } } int numClusteredElements = elementSet.size(); for (E e : unclusteredElements) { if (!elementSet.add(e)) { String msg = "An element may not be in a cluster and unclustered." + " Found unclustered element in a cluster." + " Element=" + e; throw new IllegalArgumentException(msg); } } int numElements = elementSet.size(); reporter.report(LogLevel.INFO, "# Clustered Elements=" + numClusteredElements); reporter.report(LogLevel.INFO, "# Unclustered Elements=" + unclusteredElements.size()); reporter.report(LogLevel.INFO, "# Elements Total=" + numElements); @SuppressWarnings("unchecked") E[] elements = (E[]) new Object[numElements]; int i = 0; for (Set cluster : clustering) for (E e : cluster) elements[i++] = e; for (E e : unclusteredElements) elements[i++] = e; reporter.report(LogLevel.DEBUG, "Converting to vectors"); // cut and paste from main clustering final int[][] featuress = new int[numElements][]; final double[][] valss = new double[numElements][]; final double[] eltSqLengths = new double[numElements]; MapSymbolTable symTab = toVectors(elements,featuress,valss,eltSqLengths); int numDims = symTab.numSymbols(); reporter.report(LogLevel.INFO,"#Dimensions=" + numDims); double[][] centroidss = new double[numClusters][numDims]; int[] closestCenters = new int[numElements]; i = 0; int k = 0; for (Set cluster : clustering) { double[] centroidK = centroidss[k]; for (E e : cluster) { closestCenters[i] = k; increment(centroidK,featuress[i],valss[i]); ++i; } ++k; } double[] sqDistToCenters = new double[numElements]; Arrays.fill(sqDistToCenters,Double.POSITIVE_INFINITY); // reassign everyone for (k = 0; k < numClusters; ++k) { double[] centroidK = centroidss[k]; double centroidSqLength = selfProduct(centroidss[k]); for (i = 0; i < numElements; ++i) { double sqDistToCenter = centroidSqLength + eltSqLengths[i] - 2.0 * product(centroidK,featuress[i],valss[i]); if (sqDistToCenter < sqDistToCenters[i]) { sqDistToCenters[i] = sqDistToCenter; closestCenters[i] = k; } } } for (double[] centroid : centroidss) Arrays.fill(centroid,0.0); setCentroids(centroidss,featuress,valss,closestCenters); return kMeansEpochs(elements,eltSqLengths, centroidss, featuress,valss, sqDistToCenters,closestCenters, maxEpochs,reporter); } private Set> kMeansEpochs(E[] elements, double[] eltSqLengths, double[][] centroidss, int[][] featuress, double[][] valss, double[] sqDistToCenters, int[] closestCenters, int maxEpochs, Reporter reporter) { int numClusters = centroidss.length; int numDims = centroidss[0].length; int numElements = elements.length; final double[] centroidSqLengths = centroidSqLengths(centroidss); boolean[] lastCentroidChanges = createBooleanArray(numClusters,true); final int[] changedClusters = new int[numClusters]; final int[] counts = new int[numClusters]; double lastError = Double.POSITIVE_INFINITY; for (int epoch = 0; epoch < maxEpochs; ++epoch) { reporter.report(LogLevel.DEBUG,"Epoch=" + epoch); boolean atLeastOneClusterChanged = false; int numChangedClusters = setChangedClusters(changedClusters,lastCentroidChanges); reporter.report(LogLevel.DEBUG," #changed clusters=" + numChangedClusters); final boolean[] centroidChanges = createBooleanArray(numClusters,false); // *** multi-thread all of this loop; optimistic updates sets to true *** for (int i = 0; i < numElements; ++i) { final int[] featuresI = featuress[i]; final double[] valsI = valss[i]; final double eltSqLengthI = eltSqLengths[i]; double closestSqDistToCenter = lastCentroidChanges[closestCenters[i]] ? Double.POSITIVE_INFINITY : sqDistToCenters[i]; int bestCenter = -1; // set if beat unchanged ctr, or if ctr changed for (int kk = 0; kk < numChangedClusters; ++kk) { // cut and paste (below) int k = changedClusters[kk]; double sqDistToCenter = centroidSqLengths[k] + eltSqLengthI - 2.0 * product(centroidss[k],featuresI,valsI); if (sqDistToCenter < closestSqDistToCenter) { closestSqDistToCenter = sqDistToCenter; bestCenter = k; } } // leave unchanged if can't beat old (or infty if old's expired) if (bestCenter == -1) continue; // have to be worse than previous best, or skip unchanged clusts if (closestSqDistToCenter > sqDistToCenters[i]) { for (int kk = numChangedClusters; kk < numClusters; ++kk) { // cut and paste (above) int k = changedClusters[kk]; double sqDistToCenter = centroidSqLengths[k] + eltSqLengthI - 2.0 * product(centroidss[k],featuresI,valsI); if (sqDistToCenter < closestSqDistToCenter) { closestSqDistToCenter = sqDistToCenter; bestCenter = k; } } } // could change even if center id doesn't sqDistToCenters[i] = closestSqDistToCenter; if (bestCenter == closestCenters[i]) continue; atLeastOneClusterChanged = true; centroidChanges[bestCenter] = true; // to centroidChanges[closestCenters[i]] = true; // from closestCenters[i] = bestCenter; } double error = sum(sqDistToCenters)/numElements; reporter.report(LogLevel.DEBUG, " avg dist to center=" + error); if (!atLeastOneClusterChanged) { reporter.report(LogLevel.INFO,"Converged by no elements changing cluster."); break; } double relImprovement = relativeImprovement(lastError,error); if (relImprovement < mMinRelativeImprovement) { reporter.report(LogLevel.INFO, "Converged by relative improvement < threshold"); break; } Arrays.fill(counts,0); int numChangedElts = 0; for (int k = 0; k < numClusters; ++k) if (centroidChanges[k]) Arrays.fill(centroidss[k],0.0); for (int i = 0; i < numElements; ++i) { int closestCenterI = closestCenters[i]; if (centroidChanges[closestCenterI]) { increment(centroidss[closestCenterI], featuress[i],valss[i]); ++counts[closestCenterI]; ++numChangedElts; } } reporter.report(LogLevel.DEBUG, " #changed elts=" + numChangedElts); for (int k = 0; k < numClusters; ++k) { if (counts[k] > 0) { final double[] centroidK = centroidss[k]; double countD = (double) counts[k]; double sqLength = 0.0; for (int d = 0; d < numDims; ++d) { centroidK[d] /= countD; sqLength += centroidK[d] * centroidK[d]; } centroidSqLengths[k] = sqLength; } } lastCentroidChanges = centroidChanges; if (epoch == (maxEpochs-1)) { reporter.report(LogLevel.INFO, "Reached max epochs. Breaking without convergence."); } } reporter.report(LogLevel.DEBUG,"Constructing Result"); List> scoreMapList = new ArrayList>(numClusters); double[] totalScores = new double[numClusters]; for (int k = 0; k < numClusters; ++k) scoreMapList.add(new ObjectToDoubleMap()); for (int i = 0; i < numElements; ++i) { scoreMapList.get(closestCenters[i]).set(elements[i], sqDistToCenters[i] == 0.0 ? -Double.MIN_VALUE : -sqDistToCenters[i]); totalScores[closestCenters[i]] -= sqDistToCenters[i]; } ObjectToDoubleMap> clusterScores = new ObjectToDoubleMap>(); for (int k = 0; k < numClusters; ++k) { ObjectToDoubleMap clusterDistances = scoreMapList.get(k); if (clusterDistances.isEmpty()) continue; Set cluster = new LinkedHashSet(clusterDistances.keysOrderedByValueList()); clusterScores.set(cluster, totalScores[k] == 0.0 ? -Double.MIN_VALUE : totalScores[k]/cluster.size()); } Set> result = new LinkedHashSet>(clusterScores.keysOrderedByValueList()); return result; } static double relativeImprovement(double x, double y) { return Math.abs(2.0 * (x - y) /(Math.abs(x) + Math.abs(y))); } static int setChangedClusters(int[] clusterIndexes, boolean[] changed) { int numChanged = 0; int numNotChanged = clusterIndexes.length - 1; // really fancy: numNotChanged = changed.length-i-numChanged-1 for (int i = 0; i < changed.length; ++i) clusterIndexes[changed[i] ? numChanged++ : numNotChanged--] = i; return numChanged; } static boolean[] createBooleanArray(int length, boolean fillValue) { boolean[] result = new boolean[length]; if (fillValue) Arrays.fill(result,true); return result; } private MapSymbolTable toVectors(E[] elements, int[][] featuress, double[][] valss, double[] eltSqLengths) { MapSymbolTable symTab = new MapSymbolTable(); for (int i = 0; i < elements.length; ++i) { E e = elements[i]; Map featureMap = mFeatureExtractor.features(e); featuress[i] = new int[featureMap.size()]; valss[i] = new double[featureMap.size()]; int j = 0; for (Map.Entry entry : featureMap.entrySet()) { featuress[i][j] = symTab.getOrAddSymbol(entry.getKey()); valss[i][j] = entry.getValue().doubleValue(); ++j; } eltSqLengths[i] = selfProduct(valss[i]); } return symTab; } private Set> trivialClustering(Set elementSet) { Set> clustering = new HashSet>((3 * elementSet.size()) / 2); for (E elt : elementSet) { Set cluster = SmallSet.create(elt); clustering.add(cluster); } return clustering; } private void randomInit(int[][] featuress, double[][] valss, int[] closestCenters, double[][] centroidss, Random random) { int numClusters = centroidss.length; int numElements = featuress.length; int[] permutation = Statistics.permutation(numElements,random); int[] count = new int[numClusters]; for (int i = 0; i < numElements; ++i) closestCenters[i] = i % numClusters; setCentroids(centroidss,featuress,valss,closestCenters); } private void kmeansPlusPlusInit(int[][] featuress, double[][] valss, double[] eltSqLengths, int[] closestCenters, double[][] centroidss, Random random) { int numClusters = centroidss.length; int numElements = featuress.length; double[] sqDistToCenters = new double[numElements]; Arrays.fill(sqDistToCenters,Double.POSITIVE_INFINITY); for (int k = 0; k < numClusters; ++k) { final double[] centroidK = centroidss[k]; int centroidIndex = (k == 0) ? random.nextInt(numElements) : sampleNextCenter(sqDistToCenters,random); setCentroid(centroidK, featuress[centroidIndex],valss[centroidIndex]); double centroidSqLength = selfProduct(valss[centroidIndex]); for (int i = 0; i < numElements; ++i) { double sqDistToCenter = centroidSqLength + eltSqLengths[i] - 2.0 * product(centroidK,featuress[i],valss[i]); if (sqDistToCenter < sqDistToCenters[i]) { sqDistToCenters[i] = sqDistToCenter; closestCenters[i] = k; } } } for (double[] centroid : centroidss) Arrays.fill(centroid,0.0); // reset after previous use setCentroids(centroidss,featuress,valss,closestCenters); } private void setCentroids(double[][] centroidss, int[][] featuress, double[][] valss, int[] closestCenters) { int numClusters = centroidss.length; int numElements = featuress.length; final int[] count = new int[numClusters]; for (int i = 0; i < numElements; ++i) { increment(centroidss[closestCenters[i]], featuress[i],valss[i]); ++count[closestCenters[i]]; } for (int k = 0; k < numClusters; ++k) { double countK = count[k]; final double[] centroid = centroidss[k]; for (int d = 0; d < centroid.length; ++d) { centroid[d] = centroid[d] / countK; } } } private static int sampleNextCenter(double[] probRatios, Random random) { double samplePoint = random.nextDouble() * sum(probRatios); double total = 0.0; for (int i = 0; i < probRatios.length; ++i) { total += probRatios[i]; if (total >= samplePoint) return i; } return probRatios.length-1; // arith overrun } private static double[] centroidSqLengths(double[][] centroidss) { double[] result = new double[centroidss.length]; for (int i = 0; i < result.length; ++i) result[i] = selfProduct(centroidss[i]); return result; } private static double selfProduct(double[] xs) { double sum = 0.0; for (int i = 0; i < xs.length; ++i) sum += xs[i] * xs[i]; return sum; } private static double sum(double[] xs) { double sum = 0.0; for (int i = 0; i < xs.length; ++i) sum += xs[i]; return sum; } // x = (indexes,values); c=centroid; returns x*c private static double product(double[] centroid, int[] features, double[] values) { double sum = 0.0; for (int i = 0; i < features.length; ++i) sum += values[i] * centroid[features[i]]; return sum; } private static void setCentroid(double[] centroid, int[] indexes, double[] values) { for (int i = 0; i < indexes.length; ++i) centroid[indexes[i]] = values[i]; } private static void increment(double[] centroid, int[] indexes, double[] values) { for (int i = 0; i < indexes.length; ++i) centroid[indexes[i]] += values[i]; } }



© 2015 - 2025 Weber Informatics LLC | Privacy Policy