Stanford statistical learning software
This is a collection of R packages written by current and former members
of the labs of Trevor Hastie, Jon Taylor and Rob Tibshirani. All of these packages
are actively supported by their authors.
Lasso, elastic net and regularized modelling
Our most popular, and actively updated and maintained package.
Extremely efficient procedures for fitting the entire lasso or elastic-net regularization
path for linear regression, logistic and multinomial regression models, poisson regression
and the Cox model. Two recent additions are the multiresponse
gaussian, and the grouped multinomial. The algorithm uses
cyclical coordinate descent in a pathwise fashion, as described in the paper listed below.
Least angle regression. Efficient procedures for fitting an entire lasso sequence
with the cost of a single least squares fit.
Stepwise regression and infinitessimal forward stagewise regression are options as well.
Less efficient than glmnet, but returns entire continuous path of solutions
including the knots. The latter for important for inference-- see covTest
A path-following algorithm for L1 regularized generalized
linear models and Cox proportional hazards model.
Like LARS, less efficient than glmnet but returns entire continuous path of solutions
including the knots.
Mee Young Park
Fit sparse linear regression models via nonconvex optimization.
Sparsenet uses the MC+ penalty of Zhang. It computes the
regularization surface over both the family parameter and the
tuning parameter by coordinate descent.
Group lasso and sparse group lasso.
Computes the covariance test significance testing in adaptive linear modelling. Can be used with
LARS (lasso) for linear models, elastic net, binomial and Cox survival model. This package should
be considered EXPERIMENTAL. The background paper (Lockhart et al 2013) is not yet published and rigorous theory
does not yet exist for the logistic and Cox models.
Fused lasso, trend filtering, generalized lasso
This package implements a path algorithm for the Fused
Lasso Signal Approximator. It includes functions for 1D data (signals) and 2D data (images).
Path algorithms for Generalized lasso problems, including trend and 2D filtering
and the fused lasso. Maintained by
A Lasso for Hierarchical Interactions.
Fits sparse interaction models for continuous and binary
responses subject to the strong (or weak) hierarchy restriction
that an interaction between two variables only be included if
both (or at least one of) the variables is included as a main
effect. For more details, see Bien, J., Taylor, J.,Tibshirani, R., (2012) ‘‘A Lasso for Hierarchical Interactions’’, Annals of Statistics.
This package searches for marginal interactions in a
binary response model. Interact uses permutation methods to
estimate false discovery rates for these marginal interactions
and has some, limited visualization capabilities.
Graphical lasso: estimation of the edges in an undirected graphical
model (inverse covariance model) using an L1 penalty.
Sparse SVD, principal components, canonical correlation analysis
Penalized Multivariate Analysis: a penalized
matrix decomposition, sparse principal components analysis, and
sparse canonical correlation analysis.
Implements the sparse clustering methods of Witten and Tibshirani (2010): "A framework for feature selection in clustering"; published in Journal of the American Statistical Association 105(490): 713-726.
Performs minimax linkage hierarchical clustering. Every cluster has an associated prototype element that represents that cluster as described in Bien, J., and Tibshirani, R. (2011), "Hierarchical Clustering with Prototypes via Minimax Linkage," The Journal of the American Statistical Association.
Support vector Machines
Path algorithm for Support Vector Machines.
Computes the entire regularization path for the two-class
svm classifier with essentially the same cost as a single SVM fit.
High dimensional hypothesis testing and classification, especially for genomics.
Significance analysis for microarrays. This package does significance testing
and estimates FDRs for high-dimensional problems. Can handle a wide variety
of outcome types- two and multiclass, quantitative, survival, timecourse, etc.
This package is the underlying "engine" for the popular SAM Excel Addin.
Prediction analysis for microarrays. Some functions for sample classification in microarrays and other high dimensional classification problems, using
the nearest shrunken centroid method.
Gene set analysis- an alternative approach to gene set enrichment analysis,
due to Efron and tibshirani (2007), AOAS.
Generalized additive models
Fits generalized additive models. Maintained by
Independent components analysis
Product Density Independent Components Analysis. Estimate ICA components
using the Product Density Maximum likelihood method due to Hastie and Tibshirani .
SoftImpute is a package for matrix completion - i.e. for imputing missing values in matrices.
It uses squared-error loss with nuclear norm regularization - one can think of it as
the "lasso" for matrix approximation - to find a low-rank approximation to the observed entries in the matrix.
This low-rank approximation is then used to impute the missing entries.
softImpute works in a kind of "EM" fashion. Given a current guess, it fills in the missing entries.
Then it computes a soft-thresholded SVD of this complete matrix, which yields the next guess.
These steps are iterated till convergence to the solution of the convex-optimation problem.
The algorithm can work with large matrices, such as the "netflix" matrix (400K x 20K) by making heavy use
of sparse-matrix methods in the Matrix package. It creates new S4 classes such as "Incomplete" for storing the large
data matrix, and "SparseplusLowRank" for representing the completed matrix. SVD computations are done using
a specially built block-alternating algorithm, svd.als, that exploits these structures and uses warm starts.
Some of the methods used are described in
Rahul Mazumder, Trevor Hastie and Rob Tibshirani:
Spectral Regularization Algorithms for Learning Large Incomplete Matrices.
JMLR 2010 11 2287-2322.
Other newer and more efficient methods that inter-weave the alternating block algorithm steps with imputation steps will
be described in a forthcoming article.
Imputation for microarray data and other high-dimensional datasets. Maintained by
Other packages that we like and use
Gradient boosting machines
Support vector machines