smile.manifold.package.scala Maven / Gradle / Ivy

Go to download
Show more of this group Show more artifacts with this name
Show all versions of smile-scala_2.13 Show documentation
smile-scala
The newest version!
/*
 * Copyright (c) 2010-2021 Haifeng Li. All rights reserved.
 *
 * Smile is free software: you can redistribute it and/or modify
 * it under the terms of the GNU General Public License as published by
 * the Free Software Foundation, either version 3 of the License, or
 * (at your option) any later version.
 *
 * Smile is distributed in the hope that it will be useful,
 * but WITHOUT ANY WARRANTY; without even the implied warranty of
 * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
 * GNU General Public License for more details.
 *
 * You should have received a copy of the GNU General Public License
 * along with Smile.  If not, see .
 */

package smile

import smile.util.time

/** Manifold learning finds a low-dimensional basis for describing
  * high-dimensional data. Manifold learning is a popular approach to nonlinear
  * dimensionality reduction. Algorithms for this task are based on the idea
  * that the dimensionality of many data sets is only artificially high; though
  * each data point consists of perhaps thousands of features, it may be
  * described as a function of only a few underlying parameters. That is, the
  * data points are actually samples from a low-dimensional manifold that is
  * embedded in a high-dimensional space. Manifold learning algorithms attempt
  * to uncover these parameters in order to find a low-dimensional representation
  * of the data.
  *
  * Some prominent approaches are locally linear embedding
  * (LLE), Hessian LLE, Laplacian eigenmaps, and LTSA. These techniques
  * construct a low-dimensional data representation using a cost function
  * that retains local properties of the data, and can be viewed as defining
  * a graph-based kernel for Kernel PCA. More recently, techniques have been
  * proposed that, instead of defining a fixed kernel, try to learn the kernel
  * using semidefinite programming. The most prominent example of such a
  * technique is maximum variance unfolding (MVU). The central idea of MVU
  * is to exactly preserve all pairwise distances between nearest neighbors
  * (in the inner product space), while maximizing the distances between points
  * that are not nearest neighbors.
  *
  * An alternative approach to neighborhood preservation is through the
  * minimization of a cost function that measures differences between
  * distances in the input and output spaces. Important examples of such
  * techniques include classical multidimensional scaling (which is identical
  * to PCA), Isomap (which uses geodesic distances in the data space), diffusion
  * maps (which uses diffusion distances in the data space), t-SNE (which
  * minimizes the divergence between distributions over pairs of points),
  * and curvilinear component analysis.
  *
  * @author Haifeng Li
  */
package object manifold {
  /** Isometric feature mapping. Isomap is a widely used low-dimensional embedding methods,
    * where geodesic distances on a weighted graph are incorporated with the
    * classical multidimensional scaling. Isomap is used for computing a
    * quasi-isometric, low-dimensional embedding of a set of high-dimensional
    * data points. Isomap is highly efficient and generally applicable to a broad
    * range of data sources and dimensionalities.
    *
    * To be specific, the classical MDS performs low-dimensional embedding based
    * on the pairwise distance between data points, which is generally measured
    * using straight-line Euclidean distance. Isomap is distinguished by
    * its use of the geodesic distance induced by a neighborhood graph
    * embedded in the classical scaling. This is done to incorporate manifold
    * structure in the resulting embedding. Isomap defines the geodesic distance
    * to be the sum of edge weights along the shortest path between two nodes.
    * The top n eigenvectors of the geodesic distance matrix, represent the
    * coordinates in the new n-dimensional Euclidean space.
    *
    * The connectivity of each data point in the neighborhood graph is defined
    * as its nearest k Euclidean neighbors in the high-dimensional space. This
    * step is vulnerable to "short-circuit errors" if k is too large with
    * respect to the manifold structure or if noise in the data moves the
    * points slightly off the manifold. Even a single short-circuit error
    * can alter many entries in the geodesic distance matrix, which in turn
    * can lead to a drastically different (and incorrect) low-dimensional
    * embedding. Conversely, if k is too small, the neighborhood graph may
    * become too sparse to approximate geodesic paths accurately.
    *
    * This class implements C-Isomap that involves magnifying the regions
    * of high density and shrink the regions of low density of data points
    * in the manifold. Edge weights that are maximized in Multi-Dimensional
    * Scaling(MDS) are modified, with everything else remaining unaffected.
    *
    * ====References:====
    *  - J. B. Tenenbaum, V. de Silva and J. C. Langford  A Global Geometric Framework for Nonlinear Dimensionality Reduction. Science 290(5500):2319-2323, 2000.
    *
    * @param data the data set.
    * @param d the dimension of the manifold.
    * @param k k-nearest neighbor.
    * @param CIsomap C-Isomap algorithm if true, otherwise standard algorithm.
    */
  def isomap(data: Array[Array[Double]], k: Int, d: Int = 2, CIsomap: Boolean = true): IsoMap = time("IsoMap") {
    IsoMap.of(data, k, d, CIsomap)
  }

  /** Locally Linear Embedding. It has several advantages over Isomap, including
    * faster optimization when implemented to take advantage of sparse matrix
    * algorithms, and better results with many problems. LLE also begins by
    * finding a set of the nearest neighbors of each point. It then computes
    * a set of weights for each point that best describe the point as a linear
    * combination of its neighbors. Finally, it uses an eigenvector-based
    * optimization technique to find the low-dimensional embedding of points,
    * such that each point is still described with the same linear combination
    * of its neighbors. LLE tends to handle non-uniform sample densities poorly
    * because there is no fixed unit to prevent the weights from drifting as
    * various regions differ in sample densities.
    *
    * ====References:====
    *  - Sam T. Roweis and Lawrence K. Saul. Nonlinear Dimensionality Reduction by Locally Linear Embedding. Science 290(5500):2323-2326, 2000.
    *
    * @param data the data set.
    * @param d the dimension of the manifold.
    * @param k k-nearest neighbor.
    */
  def lle(data: Array[Array[Double]], k: Int, d: Int = 2): LLE = time("LLE") {
    LLE.of(data, k, d)
  }

  /** Laplacian Eigenmap. Using the notion of the Laplacian of the nearest
    * neighbor adjacency graph, Laplacian Eigenmap compute a low dimensional
    * representation of the dataset that optimally preserves local neighborhood
    * information in a certain sense. The representation map generated by the
    * algorithm may be viewed as a discrete approximation to a continuous map
    * that naturally arises from the geometry of the manifold.
    *
    * The locality preserving character of the Laplacian Eigenmap algorithm makes
    * it relatively insensitive to outliers and noise. It is also not prone to
    * "short circuiting" as only the local distances are used.
    *
    * ====References:====
    *  - Mikhail Belkin and Partha Niyogi. Laplacian Eigenmaps and Spectral Techniques for Embedding and Clustering. NIPS, 2001.
    *
    * @param data the data set.
    * @param d the dimension of the manifold.
    * @param k k-nearest neighbor.
    * @param t the smooth/width parameter of heat kernel e^{-||x-y||² / t}.
    *          Non-positive value means discrete weights.
    */
  def laplacian(data: Array[Array[Double]], k: Int, d: Int = 2, t: Double = -1): LaplacianEigenmap = time("Laplacian Eigen Map") {
    LaplacianEigenmap.of(data, k, d, t)
  }

  /** t-distributed stochastic neighbor embedding. t-SNE is a nonlinear
    * dimensionality reduction technique that is particularly well suited
    * for embedding high-dimensional data into a space of two or three
    * dimensions, which can then be visualized in a scatter plot. Specifically,
    * it models each high-dimensional object by a two- or three-dimensional
    * point in such a way that similar objects are modeled by nearby points
    * and dissimilar objects are modeled by distant points.
    *
    * ====References:====
    *  - L.J.P. van der Maaten. Accelerating t-SNE using Tree-Based Algorithms. Journal of Machine Learning Research 15(Oct):3221-3245, 2014.
    *  - L.J.P. van der Maaten and G.E. Hinton. Visualizing Non-Metric Similarities in Multiple Maps. Machine Learning 87(1):33-55, 2012.
    *  - L.J.P. van der Maaten. Learning a Parametric Embedding by Preserving Local Structure. In Proceedings of the Twelfth International Conference on Artificial Intelligence & Statistics (AI-STATS), JMLR W&CP 5:384-391, 2009.
    *  - L.J.P. van der Maaten and G.E. Hinton. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 9(Nov):2579-2605, 2008.
    *
    * @param X input data. If X is a square matrix, it is assumed to be the squared distance/dissimilarity matrix.
    * @param d the dimension of the manifold.
    * @param perplexity the perplexity of the conditional distribution.
    * @param eta        the learning rate.
    * @param iterations the number of iterations.
    */
  def tsne(X: Array[Array[Double]], d: Int = 2, perplexity: Double = 20.0, eta: Double = 200.0, iterations: Int = 1000): TSNE = time("t-SNE") {
    new TSNE(X, d, perplexity, eta, iterations)
  }

  /**
    * Uniform Manifold Approximation and Projection.
    *
    * UMAP is a dimension reduction technique that can be used for visualization
    * similarly to t-SNE, but also for general non-linear dimension reduction.
    * The algorithm is founded on three assumptions about the data:
    *
    *  - The data is uniformly distributed on a Riemannian manifold;
    *  - The Riemannian metric is locally constant (or can be approximated as such);
    *  - The manifold is locally connected.
    *
    * From these assumptions it is possible to model the manifold with a fuzzy
    * topological structure. The embedding is found by searching for a low
    * dimensional projection of the data that has the closest possible equivalent
    * fuzzy topological structure.
    *
    * @param data               the input data.
    * @param k                  k-nearest neighbors. Larger values result in more global views
    *                           of the manifold, while smaller values result in more local data
    *                           being preserved. Generally in the range 2 to 100.
    * @param d                  The target embedding dimensions. defaults to 2 to provide easy
    *                           visualization, but can reasonably be set to any integer value
    *                           in the range 2 to 100.
    * @param iterations         The number of iterations to optimize the
    *                           low-dimensional representation. Larger values result in more
    *                           accurate embedding. Muse be greater than 10. Choose wise value
    *                           based on the size of the input data, e.g, 200 for large
    *                           data (1000+ samples), 500 for small.
    * @param learningRate       The initial learning rate for the embedding optimization,
    *                           default 1.
    * @param minDist            The desired separation between close points in the embedding
    *                           space. Smaller values will result in a more clustered/clumped
    *                           embedding where nearby points on the manifold are drawn closer
    *                           together, while larger values will result on a more even
    *                           disperse of points. The value should be set no-greater than
    *                           and relative to the spread value, which determines the scale
    *                           at which embedded points will be spread out. default 0.1.
    * @param spread             The effective scale of embedded points. In combination with
    *                           minDist, this determines how clustered/clumped the embedded
    *                           points are. default 1.0.
    * @param negativeSamples    The number of negative samples to select per positive sample
    *                           in the optimization process. Increasing this value will result
    *                           in greater repulsive force being applied, greater optimization
    *                           cost, but slightly more accuracy, default 5.
    * @param repulsionStrength  Weighting applied to negative samples in low dimensional
    *                           embedding optimization. Values higher than one will result in
    *                           greater weight being given to negative samples, default 1.0.
    */
  def umap(data: Array[Array[Double]], k: Int = 15, d: Int = 2, iterations: Int = 0, learningRate: Double = 1.0, minDist: Double = 0.1, spread: Double = 1.0, negativeSamples: Int = 5, repulsionStrength: Double = 1.0): UMAP = time("UMAP") {
    UMAP.of(data, k, d,
      if (iterations >= 10) iterations else if (data.length > 10000) 200 else 500,
      learningRate, minDist, spread, negativeSamples, repulsionStrength)
  }

  /** Classical multidimensional scaling, also known as principal coordinates
    * analysis. Given a matrix of dissimilarities (e.g. pairwise distances), MDS
    * finds a set of points in low dimensional space that well-approximates the
    * dissimilarities in A. We are not restricted to using a Euclidean
    * distance metric. However, when Euclidean distances are used MDS is
    * equivalent to PCA.
    *
    * @param proximity the non-negative proximity matrix of dissimilarities. The
    *                  diagonal should be zero and all other elements should be positive and
    *                  symmetric. For pairwise distances matrix, it should be just the plain
    *                  distance, not squared.
    * @param k the dimension of the projection.
    * @param positive if true, estimate an appropriate constant to be added
    *            to all the dissimilarities, apart from the self-dissimilarities, that
    *            makes the learning matrix positive semi-definite. The other formulation of
    *            the additive constant problem is as follows. If the proximity is
    *            measured in an interval scale, where there is no natural origin, then there
    *            is not a sympathy of the dissimilarities to the distances in the Euclidean
    *            space used to represent the objects. In this case, we can estimate a constant c
    *            such that proximity + c may be taken as ratio data, and also possibly
    *            to minimize the dimensionality of the Euclidean space required for
    *            representing the objects.
    */
  def mds(proximity: Array[Array[Double]], k: Int, positive: Boolean = false): MDS = time("MDS") {
    MDS.of(proximity, k, positive)
  }

  /** Kruskal's nonmetric MDS. In non-metric MDS, only the rank order of entries
    * in the proximity matrix (not the actual dissimilarities) is assumed to
    * contain the significant information. Hence, the distances of the final
    * configuration should as far as possible be in the same rank order as the
    * original data. Note that a perfect ordinal re-scaling of the data into
    * distances is usually not possible. The relationship is typically found
    * using isotonic regression.
    *
    * @param proximity the non-negative proximity matrix of dissimilarities. The
    *                  diagonal should be zero and all other elements should be positive and symmetric.
    * @param k the dimension of the projection.
    * @param tol tolerance for stopping iterations.
    * @param maxIter maximum number of iterations.
    */
  def isomds(proximity: Array[Array[Double]], k: Int, tol: Double = 0.0001, maxIter: Int = 200): IsotonicMDS = time("Kruskal's nonmetric MDS") {
    IsotonicMDS.of(proximity, k, tol, maxIter)
  }

  /** The Sammon's mapping is an iterative technique for making interpoint
    * distances in the low-dimensional projection as close as possible to the
    * interpoint distances in the high-dimensional object. Two points close
    * together in the high-dimensional space should appear close together in the
    * projection, while two points far apart in the high dimensional space should
    * appear far apart in the projection. The Sammon's mapping is a special case of
    * metric least-square multidimensional scaling.
    *
    * Ideally when we project from a high dimensional space to a low dimensional
    * space the image would be geometrically congruent to the original figure.
    * This is called an isometric projection. Unfortunately it is rarely possible
    * to isometrically project objects down into lower dimensional spaces. Instead of
    * trying to achieve equality between corresponding inter-point distances we
    * can minimize the difference between corresponding inter-point distances.
    * This is one goal of the Sammon's mapping algorithm. A second goal of the Sammon's
    * mapping algorithm is to preserve the topology as best as possible by giving
    * greater emphasize to smaller interpoint distances. The Sammon's mapping
    * algorithm has the advantage that whenever it is possible to isometrically
    * project an object into a lower dimensional space it will be isometrically
    * projected into the lower dimensional space. But whenever an object cannot
    * be projected down isometrically the Sammon's mapping projects it down to reduce
    * the distortion in interpoint distances and to limit the change in the
    * topology of the object.
    *
    * The projection cannot be solved in a closed form and may be found by an
    * iterative algorithm such as gradient descent suggested by Sammon. Kohonen
    * also provides a heuristic that is simple and works reasonably well.
    *
    * @param proximity the non-negative proximity matrix of dissimilarities. The
    *                  diagonal should be zero and all other elements should be positive and symmetric.
    * @param k         the dimension of the projection.
    * @param lambda    initial value of the step size constant in diagonal Newton method.
    * @param tol       tolerance for stopping iterations.
    * @param stepTol   tolerance on step size.
    * @param maxIter   maximum number of iterations.
    */
  def sammon(proximity: Array[Array[Double]], k: Int, lambda: Double = 0.2, tol: Double = 0.0001, stepTol: Double = 0.001, maxIter: Int = 100): SammonMapping = time("Sammon's Mapping") {
    SammonMapping.of(proximity, k, lambda, tol, stepTol, maxIter)
  }

  /** Hacking scaladoc [[https://github.com/scala/bug/issues/8124 issue-8124]].
    * The user should ignore this object. */
  object $dummy
}