- Statistics and Machine Learning
- The Impact of Big Data
- Supervised and Unsupervised Learning
- Linear Models and Linear Regression
- Generalized Linear Models
- Generalized Additive Models
- Logistic Regression
- Enhanced Regression
- Survival Analysis
- Decision Tree Learning
- Bayesian Methods
- Neural Networks and Deep Learning
- Support Vector Machines
- Ensemble Learning
- Automated Learning
Ensemble Learning is a term we use to describe a number of techniques that generate many predictive models to produce a hybrid model with better predictive power than the individual models. There are a number of specific techniques in this category, which we describe below.
Ensemble learning techniques use multiple models to produce an aggregate model whose predictive power is better than individual models used alone. These techniques are computationally intensive and tend to require large amounts of data. The growth in available computing power makes ensemble learning, first introduced in the 1980s, accessible for mainstream users.
Boosting is a class of iterative techniques that seeks to minimize overall errors by introducing additional models based on the errors from previous iterations. Among the many different boosting methods, the most popular are ADABoost, Gradient Boosting, and Stochastic Gradient Boosting.
Introduced by Freund and Schapire in 1995, ADABoost (Adaptive Boosting) is one of the most popular methods for ensemble learning. The ADABoost meta-algorithm operates iteratively, leveraging information about incorrectly classified cases to develop a strong aggregate model. With each pass, ADABoost tests possible classification rules and reweights them according to their ability to add to the overall predictive power of the model.
MathWorks offers a commercial implementation of ADABoost (part of the Statistics Toolbox). Many open source versions also are available, including implementations in C++, C#, Java, Python, and R.
Jerome H. Friedman introduced gradient boosting and a variant, stochastic gradient boosting, in 1999. Like other boosting techniques, gradient boosting works with any base algorithm; however, it works best with relatively simple base models and is most widely used with decision tree learning. Gradient boosting works in a manner similar to ADABoost but uses a different measure to determine the cost of errors.
Stochastic gradient boosting combines gradient boosting with random subsampling (similar to bagging). In addition to improving model accuracy, this enhancement enables the analyst to predict model performance outside the training sample. Stochastic gradient boosting is similar to random forests because both methods train a large number of decision tree models. The difference between the two is that the stochastic gradient boosting algorithm uses information about classification errors to guide the creation of incremental trees, whereas the random forests algorithm produces trees at random.
Salford Systems offers a commercial version of stochastic gradient boosting branded as TreeNet; StatSoft Data Miner supports a similar capability. Open source versions include implementations in C++ and Weka, as well as multiple packages in R.
Bootstrap Aggregation (Bagging)
Bagging is meta-algorithm proposed by Breiman in 1996. The bagging algorithm selects multiple subsamples from an original training data set, builds a model for each subsample, and then builds a solution through averaging (for regression) or through a voting procedure (for classification).
The principal advantage of bagging is its ability to build more stable models; its main disadvantage is its computational complexity and requirement for larger data sets. The growth of high-performance computing mitigates these disadvantages.
Random forests is an ensemble learning method for classification- and regression-based articles published by Ho,5 Amit and Geman,6 further developed by Breiman and Cutler,7 and trademarked by Breiman and Cutler as “Random Forests.” The random forests algorithm combines bagging (random selection of subsets from the training data) with a random selection of features, or predictors. The algorithm trains a large number of decision trees from randomly selected subsamples of the training data set and then outputs the class that is the mode of the class’s output by individual trees.
The principal advantage of random forests compared to other ensemble techniques is that its models generalize well outside the training sample. Moreover, random forests produces variable importance measures that are useful for feature selection.
Salford Systems currently offers software based on the Breiman and Cutler article branded as “Random Forests” (under license from Breiman and Cutler). Open source versions are available in Apache Mahout, C#, Python, and R.