- Overview
- Statistics and Machine Learning
- The Impact of Big Data
- Supervised and Unsupervised Learning
- Linear Models and Linear Regression
- Generalized Linear Models
- Generalized Additive Models
- Logistic Regression
- Enhanced Regression
- Survival Analysis
- Decision Tree Learning
- Bayesian Methods
- Neural Networks and Deep Learning
- Support Vector Machines
- Ensemble Learning
- Automated Learning
- Summary
Automated Learning
Can you automate predictive modeling? The answer depends on the context. Consider the two variations on the following question, with more precise wording:
- Can you eliminate the need for expertise in predictive modeling—so that an “ordinary business user” can do it?
- Can you make expert analysts more productive by automating certain repetitive tasks?
The first form of the question—the search for “business user” analytics—is a common vision among software marketing professionals and industry analysts; it assumes that expert analysts are the key bottleneck limiting enterprise adoption of predictive analytics. That premise is largely false, as is clear to anyone with a cursory understanding of the overall process for predictive analytics in most organizations. The answer is no; it is not possible to eliminate human expertise from predictive modeling, for the same reason that robotic surgery does not eliminate the need for cardiologists.
However, if you focus on the second form of the question and concentrate on how to make expert analysts more productive, the situation is much more promising. Many data preparation tasks are easy to automate; they include such tasks as detecting and eliminating zero-variance columns, treating missing values, and handling outliers. The most promising area for automation, however, is in model testing and assessment.
Optimizing a predictive model requires experimentation and tuning. For any given problem, there are many available modeling techniques, and for each technique, there are many ways to specify and parameterize a model. For the most part, trial and error is the only way to identify the best model for a given problem and data set. (The No Free Lunch Theorem8 formalizes this concept.)
Because the best predictive model depends on the problem and the data, the analyst must search a very large set of feasible options to find the best model. In applied predictive analytics, however, the analyst’s time is strictly limited; a customer in the marketing services industry reports an SLA of 30 minutes or less for the analytics team to build a predictive model. Strict time constraints do not permit much time for experimentation.
Analysts tend to deal with this problem by settling for suboptimal models, arguing that models need only be “good enough,” or defending the use of one technique above all others. As clients grow more sophisticated, however, these tactics become ineffective. In high-stakes hard-money analytics—such as trading algorithms, catastrophic risk analysis, and fraud detection—small improvements in model accuracy have a bottom-line impact, and clients demand the best possible predictions.
Automated modeling techniques are not new. Before Unica launched its successful suite of marketing automation software, the company’s primary business was analytic software, with a particular focus on neural networks. In 1995, Unica introduced Pattern Recognition Workbench (PRW), a software package that used automated trial and error to optimize a predictive model. Three years later, Unica partnered with Group 1 Software (now owned by Pitney Bowes) to market Model 1, a tool that automated model selection over four different types of predictive models. Rebranded several times, the original PRW product remains as IBM PredictiveInsight, a set of wizards sold as part of IBM’s Enterprise Marketing Management suite.
Two other commercial attempts at automated predictive modeling date from the late 1990s. The first, MarketSwitch, was less than successful. MarketSwitch developed and sold a solution for marketing offer optimization, which included an embedded “automated” predictive modeling capability (“developed by Russian rocket scientists”); in sales presentations, MarketSwitch promised customers its software would allow them to “fire their SAS programmers.” Experian acquired MarketSwitch in 2004, repositioned the product as a decision engine, and replaced the “automated modeling” capability with outsourced analytic services.
KXEN, a company founded in France in 1998, built its analytics engine around an automated model selection technique called structural risk minimization. The original product had a rudimentary user interface, depending instead on API calls from partner applications; more recently, KXEN repositioned itself as an easy-to-use solution for marketing analytics, which it attempted to sell directly to C-level executives. This effort was modestly successful, leading to the sale of the company in 2013 to SAP for an estimated $40 million.
In the past several years, the leading analytic software vendors (SAS and IBM SPSS) have added automated modeling features to their high-end products. In 2010, SAS introduced SAS Rapid Modeler, an add-in to SAS Enterprise Miner. Rapid Modeler is a set of macros implementing heuristics that handle tasks such as outlier identification, missing value treatment, variable selection, and model selection. The user specifies a data set and response measure; Rapid Modeler determines whether the response is continuous or categorical, and uses this information together with other diagnostics to test a range of modeling techniques. The user can control the scope of techniques to test by selecting basic, intermediate, or advanced methods.
IBM SPSS Modeler includes a set of automated data preparation features as well as Auto Classifier, Auto Cluster, and Auto Numeric nodes. The automated data preparation features perform such tasks as missing value imputation, outlier handling, date and time preparation, basic value screening, binning, and variable recasting. The three modeling nodes enable the user to specify techniques to be included in the test plan, specify model selection rules, and set limits on model training.
All of the software discussed so far is commercially licensed. Two open source projects are worth noting: the caret package in open source R and the MLBase project. The caret package includes a suite of productivity tools designed to accelerate model specification and tuning for a wide range of techniques. The package includes preprocessing tools to support tasks such as dummy coding, detecting zero variance predictors, and identifying correlated predictors, as well as tools to support model training and tuning. The training function in caret currently supports 149 different modeling techniques; it supports parameter optimization within a selected technique but does not optimize across techniques. To implement a test plan with multiple modeling techniques, the user must write an R script to run the required training tasks and capture the results.
MLBase, a joint project of the UC Berkeley AMPLab and the Brown University Data Management Research Group, is an ambitious effort to develop a scalable machine learning platform on Apache Spark. The ML Optimizer seeks to simplify machine learning problems for end users by automating the model selection task so that the user need only specify a response variable and set of predictors. The Optimizer project is still in active development, with Alpha release expected in 2014.
What have you learned from various attempts to implement automated predictive modeling? Commercial startups like KXEN and MarketSwitch only marginally succeeded because they tried to oversell the concept as a means to replace the analyst altogether. Most organizations understand that human judgment plays a key role in analytics, and they are not willing to entrust hard money analytics entirely to a black box.
What will the next generation of automated modeling platforms look like? Seven key features are critical for an automated modeling platform:
- Automated model-dependent data transformations
- Optimization across and within techniques
- Intelligent heuristics to limit the scope of the search
- Iterative bootstrapping to expedite search
- Massively parallel design
- Platform agnostic design
- Custom algorithms
Some methods require specific data transformations; neural nets, for example, typically work with standardized predictors, whereas Naïve Bayes and CHAID require all predictors to be categorical. The analyst should not have to perform these operations manually; instead, the modeling algorithm should build the transformations into the test plan script and run them automatically; this ensures the maximum number of possible techniques for any data set.
To find the best predictive model, you need to be able to search across techniques and tune parameters within techniques. Potentially, this can mean a massive number of model train-and-test cycles to run; you can use heuristics to limit the scope of techniques evaluated based on characteristics of the response measure and the predictors. (For example, a categorical response rules out a number of techniques, and a continuous response measure rules out a different set of techniques.) Instead of a brute force search for the best technique and parameterization, a “bootstrapping” approach can use information from early iterations to specify subsequent tests.
Even with heuristics and bootstrapping, a comprehensive experimental design may require thousands of model train-and-test cycles; this is a natural application for massively parallel computing. Moreover, the highly variable workload inherent in the development phase of predictive analytics is a natural application for cloud (a point that deserves yet another blog post of its own). The next generation of automated predictive modeling will be in the cloud from its inception.
Ideally, the model automation wrapper should be agnostic to specific implementations of machine learning techniques; the user should be able to optimize across software brands and versions. Realistically, commercial vendors such as SAS and IBM will never permit their software to run under an optimizer that they do not own; hence, as a practical matter, you should assume that the next generation predictive modeling platform will work with open source machine learning libraries, such as R or Python.
You cannot eliminate the need for human expertise from predictive modeling, but you can build tools that enable analysts to build better models.