smile.regression.kt Maven / Gradle / Ivy
/*
* Copyright (c) 2010-2021 Haifeng Li. All rights reserved.
*
* Smile is free software: you can redistribute it and/or modify
* it under the terms of the GNU General Public License as published by
* the Free Software Foundation, either version 3 of the License, or
* (at your option) any later version.
*
* Smile is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with Smile. If not, see .
*/
package smile.regression
import smile.base.cart.Loss
import smile.base.rbf.RBF
import smile.data.DataFrame
import smile.data.formula.Formula
import smile.math.kernel.MercerKernel
/**
* Fitting linear models (ordinary least squares). In linear regression,
* the model specification is that the dependent variable is a linear
* combination of the parameters (but need not be linear in the independent
* variables). The residual is the difference between the value of the
* dependent variable predicted by the model, and the true value of the
* dependent variable. Ordinary least squares obtains parameter estimates
* that minimize the sum of squared residuals, SSE (also denoted RSS).
*
* The OLS estimator is consistent when the independent variables are
* exogenous and there is no multicollinearity, and optimal in the class
* of linear unbiased estimators when the errors are homoscedastic and
* serially uncorrelated. Under these conditions, the method of OLS provides
* minimum-variance mean-unbiased estimation when the errors have finite
* variances.
*
* There are several different frameworks in which the linear regression
* model can be cast in order to make the OLS technique applicable. Each
* of these settings produces the same formulas and same results, the only
* difference is the interpretation and the assumptions which have to be
* imposed in order for the method to give meaningful results. The choice
* of the applicable framework depends mostly on the nature of data at hand,
* and on the inference task which has to be performed.
*
* Least squares corresponds to the maximum likelihood criterion if the
* experimental errors have a normal distribution and can also be derived
* as a method of moments estimator.
*
* Once a regression model has been constructed, it may be important to
* confirm the goodness of fit of the model and the statistical significance
* of the estimated parameters. Commonly used checks of goodness of fit
* include the R-squared, analysis of the pattern of residuals and hypothesis
* testing. Statistical significance can be checked by an F-test of the overall
* fit, followed by t-tests of individual parameters.
*
* Interpretations of these diagnostic tests rest heavily on the model
* assumptions. Although examination of the residuals can be used to
* invalidate a model, the results of a t-test or F-test are sometimes more
* difficult to interpret if the model's assumptions are violated.
* For example, if the error term does not have a normal distribution,
* in small samples the estimated parameters will not follow normal
* distributions and complicate inference. With relatively large samples,
* however, a central limit theorem can be invoked such that hypothesis
* testing may proceed using asymptotic approximations.
*
* @param formula a symbolic description of the model to be fitted.
* @param data the data frame of the explanatory and response variables.
* @param method the fitting method ("svd" or "qr").
* @param recursive if true, the return model supports recursive least squares.
*/
fun lm(formula: Formula, data: DataFrame, method: String = "qr", stderr: Boolean = true, recursive: Boolean = true): LinearModel {
return OLS.fit(formula, data, method, stderr, recursive)
}
/**
* Ridge Regression. When the predictor variables are highly correlated amongst
* themselves, the coefficients of the resulting least squares fit may be very
* imprecise. By allowing a small amount of bias in the estimates, more
* reasonable coefficients may often be obtained. Ridge regression is one
* method to address these issues. Often, small amounts of bias lead to
* dramatic reductions in the variance of the estimated model coefficients.
* Ridge regression is such a technique which shrinks the regression
* coefficients by imposing a penalty on their size. Ridge regression was
* originally developed to overcome the singularity of the X'X matrix.
* This matrix is perturbed so as to make its determinant appreciably
* different from 0.
*
* Ridge regression is a kind of Tikhonov regularization, which is the most
* commonly used method of regularization of ill-posed problems. Another
* interpretation of ridge regression is available through Bayesian estimation.
* In this setting the belief that weight should be small is coded into a prior
* distribution.
*
* @param formula a symbolic description of the model to be fitted.
* @param data the data frame of the explanatory and response variables.
* @param lambda the shrinkage/regularization parameter.
*/
fun ridge(formula: Formula, data: DataFrame, lambda: Double): LinearModel {
return RidgeRegression.fit(formula, data, lambda)
}
/**
* Least absolute shrinkage and selection operator.
* The Lasso is a shrinkage and selection method for linear regression.
* It minimizes the usual sum of squared errors, with a bound on the sum
* of the absolute values of the coefficients (i.e. L1-regularized).
* It has connections to soft-thresholding of wavelet coefficients, forward
* stage-wise regression, and boosting methods.
*
* The Lasso typically yields a sparse solution, of which the parameter
* vector β has relatively few nonzero coefficients. In contrast, the
* solution of L2-regularized least squares (i.e. ridge regression)
* typically has all coefficients nonzero. Because it effectively
* reduces the number of variables, the Lasso is useful in some contexts.
*
* For over-determined systems (more instances than variables, commonly in
* machine learning), we normalize variables with mean 0 and standard deviation
* 1. For under-determined systems (less instances than variables, e.g.
* compressed sensing), we assume white noise (i.e. no intercept in the linear
* model) and do not perform normalization. Note that the solution
* is not unique in this case.
*
* There is no analytic formula or expression for the optimal solution to the
* L1-regularized least squares problems. Therefore, its solution
* must be computed numerically. The objective function in the
* L1-regularized least squares is convex but not differentiable,
* so solving it is more of a computational challenge than solving the
* L2-regularized least squares. The Lasso may be solved using
* quadratic programming or more general convex optimization methods, as well
* as by specific algorithms such as the least angle regression algorithm.
*
* ====References:====
* - R. Tibshirani. Regression shrinkage and selection via the lasso. J. Royal. Statist. Soc B., 58(1):267-288, 1996.
* - B. Efron, I. Johnstone, T. Hastie, and R. Tibshirani. Least angle regression. Annals of Statistics, 2003
* - Seung-Jean Kim, K. Koh, M. Lustig, Stephen Boyd, and Dimitry Gorinevsky. An Interior-Point Method for Large-Scale L1-Regularized Least Squares. IEEE JOURNAL OF SELECTED TOPICS IN SIGNAL PROCESSING, VOL. 1, NO. 4, 2007.
*
* @param formula a symbolic description of the model to be fitted.
* @param data the data frame of the explanatory and response variables.
* @param lambda the shrinkage/regularization parameter.
* @param tol the tolerance for stopping iterations (relative target duality gap).
* @param maxIter the maximum number of iterations.
*/
fun lasso(formula: Formula, data: DataFrame, lambda: Double, tol: Double = 1E-3, maxIter: Int = 5000): LinearModel {
return LASSO.fit(formula, data, lambda, tol, maxIter)
}
/**
* Support vector regression. Like SVM for classification, the model produced
* by SVR depends only on a subset of the training data, because the cost
* function ignores any training data close to the model prediction (within
* a threshold).
*
* @param x training data.
* @param y response variable.
* @param kernel the kernel function.
* @param eps the loss function error threshold.
* @param C the soft margin penalty parameter.
* @param tol the tolerance of convergence test.
* @tparam T the data type
*
* @return SVR model.
*/
fun svr(x: Array, y: DoubleArray, kernel: MercerKernel, eps: Double, C: Double, tol: Double = 1E-3): KernelMachine {
return SVM.fit(x, y, kernel, eps, C, tol)
}
/**
* Regression tree. A classification/regression tree can be learned by
* splitting the training set into subsets based on an attribute value
* test. This process is repeated on each derived subset in a recursive
* manner called recursive partitioning. The recursion is completed when
* the subset at a node all has the same value of the target variable,
* or when splitting no longer adds value to the predictions.
*
* The algorithms that are used for constructing decision trees usually
* work top-down by choosing a variable at each step that is the next best
* variable to use in splitting the set of items. "Best" is defined by how
* well the variable splits the set into homogeneous subsets that have
* the same value of the target variable. Different algorithms use different
* formulae for measuring "best". Used by the CART algorithm, Gini impurity
* is a measure of how often a randomly chosen element from the set would
* be incorrectly labeled if it were randomly labeled according to the
* distribution of labels in the subset. Gini impurity can be computed by
* summing the probability of each item being chosen times the probability
* of a mistake in categorizing that item. It reaches its minimum (zero) when
* all cases in the node fall into a single target category. Information gain
* is another popular measure, used by the ID3, C4.5 and C5.0 algorithms.
* Information gain is based on the concept of entropy used in information
* theory. For categorical variables with different number of levels, however,
* information gain are biased in favor of those attributes with more levels.
* Instead, one may employ the information gain ratio, which solves the drawback
* of information gain.
*
* Classification and Regression Tree techniques have a number of advantages
* over many of those alternative techniques.
*
* - '''Simple to understand and interpret:'''
* In most cases, the interpretation of results summarized in a tree is
* very simple. This simplicity is useful not only for purposes of rapid
* classification of new observations, but can also often yield a much simpler
* "model" for explaining why observations are classified or predicted in a
* particular manner.
* - '''Able to handle both numerical and categorical data:'''
* Other techniques are usually specialized in analyzing datasets that
* have only one type of variable.
* - '''Nonparametric and nonlinear:'''
* The final results of using tree methods for classification or regression
* can be summarized in a series of (usually few) logical if-then conditions
* (tree nodes). Therefore, there is no implicit assumption that the underlying
* relationships between the predictor variables and the dependent variable
* are linear, follow some specific non-linear link function, or that they
* are even monotonic in nature. Thus, tree methods are particularly well
* suited for data mining tasks, where there is often little a priori
* knowledge nor any coherent set of theories or predictions regarding which
* variables are related and how. In those types of data analytics, tree
* methods can often reveal simple relationships between just a few variables
* that could have easily gone unnoticed using other analytic techniques.
*
* One major problem with classification and regression trees is their high
* variance. Often a small change in the data can result in a very different
* series of splits, making interpretation somewhat precarious. Besides,
* decision-tree learners can create over-complex trees that cause over-fitting.
* Mechanisms such as pruning are necessary to avoid this problem.
* Another limitation of trees is the lack of smoothness of the prediction
* surface.
*
* Some techniques such as bagging, boosting, and random forest use more than
* one decision tree for their analysis.
*
* @param formula a symbolic description of the model to be fitted.
* @param data the data frame of the explanatory and response variables.
* @param maxDepth the maximum depth of the tree.
* @param maxNodes the maximum number of leaf nodes in the tree.
* @param nodeSize the minimum size of leaf nodes.
* @return Regression tree model.
*/
fun cart(formula: Formula, data: DataFrame, maxDepth: Int = 20, maxNodes: Int = 0, nodeSize: Int = 5): RegressionTree {
return RegressionTree.fit(formula, data, maxDepth, if (maxNodes > 0) maxNodes else data.size() / nodeSize, nodeSize)
}
/**
* Random forest for regression. Random forest is an ensemble classifier
* that consists of many decision trees and outputs the majority vote of
* individual trees. The method combines bagging idea and the random
* selection of features.
*
* Each tree is constructed using the following algorithm:
*
* 1. If the number of cases in the training set is N, randomly sample N cases
* with replacement from the original data. This sample will
* be the training set for growing the tree.
* 2. If there are M input variables, a number m << M is specified such
* that at each node, m variables are selected at random out of the M and
* the best split on these m is used to split the node. The value of m is
* held constant during the forest growing.
* 3. Each tree is grown to the largest extent possible. There is no pruning.
*
* The advantages of random forest are:
*
* - For many data sets, it produces a highly accurate classifier.
* - It runs efficiently on large data sets.
* - It can handle thousands of input variables without variable deletion.
* - It gives estimates of what variables are important in the classification.
* - It generates an internal unbiased estimate of the generalization error
* as the forest building progresses.
* - It has an effective method for estimating missing data and maintains
* accuracy when a large proportion of the data are missing.
*
* The disadvantages are
*
* - Random forests are prone to over-fitting for some datasets. This is
* even more pronounced on noisy data.
* - For data including categorical variables with different number of
* levels, random forests are biased in favor of those attributes with more
* levels. Therefore, the variable importance scores from random forest are
* not reliable for this type of data.
*
* @param formula a symbolic description of the model to be fitted.
* @param data the data frame of the explanatory and response variables.
* @param ntrees the number of trees.
* @param mtry the number of input variables to be used to determine the decision
* at a node of the tree. dim/3 seems to give generally good performance,
* where dim is the number of variables.
* @param maxDepth the maximum depth of the tree.
* @param maxNodes the maximum number of leaf nodes in the tree.
* @param nodeSize the minimum size of leaf nodes.
* @param subsample the sampling rate for training tree. 1.0 means sampling with replacement.
* < 1.0 means sampling without replacement.
*
* @return Random forest regression model.
*/
fun randomForest(formula: Formula, data: DataFrame, ntrees: Int = 500, mtry: Int = 0, maxDepth: Int = 20,
maxNodes: Int = 500, nodeSize: Int = 5, subsample: Double = 1.0): RandomForest {
return RandomForest.fit(formula, data, ntrees, mtry, maxDepth, maxNodes, nodeSize, subsample)
}
/**
* Gradient boosted regression trees.
*
* Generic gradient boosting at the t-th step would fit a regression tree to
* pseudo-residuals. Let J be the number of its leaves. The tree partitions
* the input space into J disjoint regions and predicts a constant value in
* each region. The parameter J controls the maximum allowed
* level of interaction between variables in the model. With J = 2 (decision
* stumps), no interaction between variables is allowed. With J = 3 the model
* may include effects of the interaction between up to two variables, and
* so on. Hastie et al. comment that typically 4 ≤ J ≤ 8 work well
* for boosting and results are fairly insensitive to the choice of in
* this range, J = 2 is insufficient for many applications, and J > 10 is
* unlikely to be required.
*
* Fitting the training set too closely can lead to degradation of the model's
* generalization ability. Several so-called regularization techniques reduce
* this over-fitting effect by constraining the fitting procedure.
* One natural regularization parameter is the number of gradient boosting
* iterations T (i.e. the number of trees in the model when the base learner
* is a decision tree). Increasing T reduces the error on training set,
* but setting it too high may lead to over-fitting. An optimal value of T
* is often selected by monitoring prediction error on a separate validation
* data set.
*
* Another regularization approach is the shrinkage which times a parameter
* η (called the "learning rate") to update term.
* Empirically it has been found that using small learning rates (such as
* η < 0.1) yields dramatic improvements in model's generalization ability
* over gradient boosting without shrinking (η = 1). However, it comes at
* the price of increasing computational time both during training and
* prediction: lower learning rate requires more iterations.
*
* Soon after the introduction of gradient boosting Friedman proposed a
* minor modification to the algorithm, motivated by Breiman's bagging method.
* Specifically, he proposed that at each iteration of the algorithm, a base
* learner should be fit on a subsample of the training set drawn at random
* without replacement. Friedman observed a substantial improvement in
* gradient boosting's accuracy with this modification.
*
* Subsample size is some constant fraction f of the size of the training set.
* When f = 1, the algorithm is deterministic and identical to the one
* described above. Smaller values of f introduce randomness into the
* algorithm and help prevent over-fitting, acting as a kind of regularization.
* The algorithm also becomes faster, because regression trees have to be fit
* to smaller datasets at each iteration. Typically, f is set to 0.5, meaning
* that one half of the training set is used to build each base learner.
*
* Also, like in bagging, sub-sampling allows one to define an out-of-bag
* estimate of the prediction performance improvement by evaluating predictions
* on those observations which were not used in the building of the next
* base learner. Out-of-bag estimates help avoid the need for an independent
* validation dataset, but often underestimate actual performance improvement
* and the optimal number of iterations.
*
* Gradient tree boosting implementations often also use regularization by
* limiting the minimum number of observations in trees' terminal nodes.
* It's used in the tree building process by ignoring any splits that lead
* to nodes containing fewer than this number of training set instances.
* Imposing this limit helps to reduce variance in predictions at leaves.
*
* ====References:====
* - J. H. Friedman. Greedy Function Approximation: A Gradient Boosting Machine, 1999.
* - J. H. Friedman. Stochastic Gradient Boosting, 1999.
*
* @param formula a symbolic description of the model to be fitted.
* @param data the data frame of the explanatory and response variables.
* @param loss loss function for regression. By default, least absolute
* deviation is employed for robust regression.
* @param ntrees the number of iterations (trees).
* @param maxDepth the maximum depth of the tree.
* @param maxNodes the maximum number of leaf nodes in the tree.
* @param nodeSize the minimum size of leaf nodes.
* @param shrinkage the shrinkage parameter in (0, 1] controls the learning rate of procedure.
* @param subsample the sampling fraction for stochastic tree boosting.
*
* @return Gradient boosted trees.
*/
fun gbm(formula: Formula, data: DataFrame, loss: Loss = Loss.lad(), ntrees: Int = 500,
maxDepth: Int = 20, maxNodes: Int = 6, nodeSize: Int = 5, shrinkage: Double = 0.05,
subsample: Double = 0.7): GradientTreeBoost {
return GradientTreeBoost.fit(formula, data, loss, ntrees, maxDepth, maxNodes, nodeSize, shrinkage, subsample)
}
/**
* Gaussian Process for Regression. A Gaussian process is a stochastic process
* whose realizations consist of random values associated with every point in
* a range of times (or of space) such that each such random variable has
* a normal distribution. Moreover, every finite collection of those random
* variables has a multivariate normal distribution.
*
* A Gaussian process can be used as a prior probability distribution over
* functions in Bayesian inference. Given any set of N points in the desired
* domain of your functions, take a multivariate Gaussian whose covariance
* matrix parameter is the Gram matrix of N points with some desired kernel,
* and sample from that Gaussian. Inference of continuous values with a
* Gaussian process prior is known as Gaussian process regression.
*
* The fitting is performed in the reproducing kernel Hilbert space with
* the "kernel trick". The loss function is squared-error. This also arises
* as the kriging estimate of a Gaussian random field in spatial statistics.
*
* A significant problem with Gaussian process prediction is that it typically
* scales as O(n3). For large problems (e.g. n > 10,000) both
* storing the Gram matrix and solving the associated linear systems are
* prohibitive on modern workstations. An extensive range of proposals have
* been suggested to deal with this problem. A popular approach is the
* reduced-rank Approximations of the Gram Matrix, known as Nystrom approximation.
* Greedy approximation is another popular approach that uses an active set of
* training points of size m selected from the training set of size n > m.
* We assume that it is impossible to search for the optimal subset of size m
* due to combinatorics. The points in the active set could be selected
* randomly, but in general we might expect better performance if the points
* are selected greedily w.r.t. some criterion. Recently, researchers had
* proposed relaxing the constraint that the inducing variables must be a
* subset of training/test cases, turning the discrete selection problem
* into one of continuous optimization.
*
* This method fits a regular Gaussian process model.
*
* ====References:====
* - Carl Edward Rasmussen and Chris Williams. Gaussian Processes for Machine Learning, 2006.
* - Joaquin Quinonero-candela, Carl Edward Ramussen, Christopher K. I. Williams. Approximation Methods for Gaussian Process Regression. 2007.
* - T. Poggio and F. Girosi. Networks for approximation and learning. Proc. IEEE 78(9):1484-1487, 1990.
* - Kai Zhang and James T. Kwok. Clustered Nystrom Method for Large Scale Manifold Learning and Dimension Reduction. IEEE Transactions on Neural Networks, 2010.
*
* @param x the training dataset.
* @param y the response variable.
* @param kernel the Mercer kernel.
* @param noise the noise variance, which also works as a regularization parameter.
* @param normalize the option to normalize the response variable.
* @param tol the stopping tolerance for HPO.
* @param maxIter the maximum number of iterations for HPO. No HPO if maxIter <= 0.
*/
fun gpr(x: Array, y: DoubleArray, kernel: MercerKernel, noise: Double, normalize: Boolean = true, tol: Double = 1E-5, maxIter: Int = 0): GaussianProcessRegression {
return GaussianProcessRegression.fit(x, y, kernel, noise, normalize, tol, maxIter)
}
/** Gaussian Process for Regression. */
object gpr {
/**
* Fits an approximate Gaussian process model with a subset of regressors.
*
* @param x the training dataset.
* @param y the response variable.
* @param t the inducing input, which are pre-selected or inducing samples
* acting as active set of regressors. In simple case, these can be chosen
* randomly from the training set or as the centers of k-means clustering.
* @param kernel the Mercer kernel.
* @param noise the noise variance, which also works as a regularization parameter.
* @param normalize the option to normalize the response variable.
*/
fun approx(x: Array, y: DoubleArray, t: Array, kernel: MercerKernel, noise: Double, normalize: Boolean = true): GaussianProcessRegression {
return GaussianProcessRegression.fit(x, y, t, kernel, noise, normalize)
}
/**
* Fits an approximate Gaussian process model with Nystrom approximation of kernel matrix.
*
* @param x the training dataset.
* @param y the response variable.
* @param t the inducing input, which are pre-selected or inducing samples
* acting as active set of regressors. In simple case, these can be chosen
* randomly from the training set or as the centers of k-means clustering.
* @param kernel the Mercer kernel.
* @param noise the noise variance, which also works as a regularization parameter.
* @param normalize the option to normalize the response variable.
*/
fun nystrom(x: Array, y: DoubleArray, t: Array, kernel: MercerKernel, noise: Double, normalize: Boolean = true): GaussianProcessRegression {
return GaussianProcessRegression.nystrom(x, y, t, kernel, noise, normalize)
}
}
/**
* Radial basis function networks. A radial basis function network is an
* artificial neural network that uses radial basis functions as activation
* functions. It is a linear combination of radial basis functions. They are
* used in function approximation, time series prediction, and control.
*
* In its basic form, radial basis function network is in the form
*
* y(x) = Σ wi φ(||x-ci||)
*
* where the approximating function y(x) is represented as a sum of N radial
* basis functions φ, each associated with a different center ci,
* and weighted by an appropriate coefficient wi. For distance,
* one usually chooses Euclidean distance. The weights wi can
* be estimated using the matrix methods of linear least squares, because
* the approximating function is linear in the weights.
*
* The centers ci can be randomly selected from training data,
* or learned by some clustering method (e.g. k-means), or learned together
* with weight parameters undergo a supervised learning processing
* (e.g. error-correction learning).
*
* The popular choices for φ comprise the Gaussian function and the so
* called thin plate splines. The advantage of the thin plate splines is that
* their conditioning is invariant under scalings. Gaussian, multi-quadric
* and inverse multi-quadric are infinitely smooth and and involve a scale
* or shape parameter, r0 > 0. Decreasing
* r0 tends to flatten the basis function. For a
* given function, the quality of approximation may strongly depend on this
* parameter. In particular, increasing r0 has the
* effect of better conditioning (the separation distance of the scaled points
* increases).
*
* A variant on RBF networks is normalized radial basis function (NRBF)
* networks, in which we require the sum of the basis functions to be unity.
* NRBF arises more naturally from a Bayesian statistical perspective. However,
* there is no evidence that either the NRBF method is consistently superior
* to the RBF method, or vice versa.
*
* SVMs with Gaussian kernel have similar structure as RBF networks with
* Gaussian radial basis functions. However, the SVM approach "automatically"
* solves the network complexity problem since the size of the hidden layer
* is obtained as the result of the QP procedure. Hidden neurons and
* support vectors correspond to each other, so the center problems of
* the RBF network is also solved, as the support vectors serve as the
* basis function centers. It was reported that with similar number of support
* vectors/centers, SVM shows better generalization performance than RBF
* network when the training data size is relatively small. On the other hand,
* RBF network gives better generalization performance than SVM on large
* training data.
*
* ====References:====
* - Simon Haykin. Neural Networks: A Comprehensive Foundation (2nd edition). 1999.
* - T. Poggio and F. Girosi. Networks for approximation and learning. Proc. IEEE 78(9):1484-1487, 1990.
* - Nabil Benoudjit and Michel Verleysen. On the kernel widths in radial-basis function networks. Neural Process, 2003.
*
* @param x training samples.
* @param y response variable.
* @param neurons the radial basis functions.
* @param normalized train a normalized RBF network or not.
*/
fun rbfnet(x: Array, y: DoubleArray, neurons: Array>, normalized: Boolean = false): RBFNetwork {
return RBFNetwork.fit(x, y, neurons, normalized)
}
/** Trains a Gaussian RBF network with k-means. */
fun rbfnet(x: Array, y: DoubleArray, k: Int, normalized: Boolean = false): RBFNetwork {
val neurons = RBF.fit(x, k)
return RBFNetwork.fit(x, y, neurons, normalized)
}
© 2015 - 2025 Weber Informatics LLC | Privacy Policy