- Overview
- Statistics and Machine Learning
- The Impact of Big Data
- Supervised and Unsupervised Learning
- Linear Models and Linear Regression
- Generalized Linear Models
- Generalized Additive Models
- Logistic Regression
- Enhanced Regression
- Survival Analysis
- Decision Tree Learning
- Bayesian Methods
- Neural Networks and Deep Learning
- Support Vector Machines
- Ensemble Learning
- Automated Learning
- Summary

## Supervised and Unsupervised Learning

In Chapter 7, we reviewed a number of analytic use cases, including text and document analytics, clustering, association, and anomaly detection. These use cases differ from the predictive modeling use case because there is no predefined response measure; the analyst seeks to identify patterns but does not seek to predict or explain a specific relationship. These use cases require unsupervised learning techniques.

Unsupervised learning refers to techniques that find patterns in unlabeled data, or data that lacks a defined response measure. Examples of unlabeled data include a bit-mapped photograph, a series of comments from social media, and a battery of psychographic data gathered from a number of subjects. In each case, it may be possible to classify the objects through an external process: For example, you can ask a panel of oncologists to review a set of breast images and classify them as possibly malignant (or not), but the classification is not a part of the raw source data. Unsupervised learning techniques help the analyst identify data-driven patterns that may warrant further investigation.

Supervised learning, on the other hand, includes techniques that require a defined response measure. Not surprisingly, analysts primarily use supervised learning techniques for predictive analytics. However, in the course of a predictive analytics project, analysts may use unsupervised learning techniques to understand the data and to expedite the model building process. Unsupervised learning techniques frequently used within the predictive modeling process include anomaly detection, graph and network analysis, Bayesian Networks, text mining, clustering, and dimension reduction.

### Anomaly Detection

An analyst working on a supermarket chain’s loyalty card spending data noticed an interesting pattern: Some customers appeared to spend exceptionally large amounts. These “supercustomers”—of whom there were no more than several dozen—accounted for a disproportionate percentage of total spending. The analyst was intrigued: Who were these supercustomers? Did it make sense to develop a special program to retain their business (in the same way that casinos target “whales”)?

On deeper investigation—a process that took considerable digging—the analyst discovered that these “supercustomers” were actually store cashiers who swiped their own loyalty cards for customers who did not have a card.

In Chapter 8, “Predictive Analytics Methodology,” we noted that analysts investigate and treat outliers as they develop the analysis data set. They do this for two reasons: First, because outliers can make it very difficult to fit a predictive model to the data at all; and second, because outliers may indicate a problem with the data, as the supermarket analyst learned.

As a rule, the analyst should remove outliers from the analysis data set only when they are artifacts of the data collection process (as is the case in the supermarket example). Investigating outliers can take a considerable amount of time; thus, the analyst needs formal methods to identify anomalies in the data as quickly as possible.

In many cases, simple univariate methods will suffice. For univariate anomaly detection, the analyst runs simple statistics on all numeric variables. The process flags records with values that exceed defined minima or maxima for each variable, and flags records whose values exceed a defined number of standard deviations from the mean. For categorical variables, the analyst compares the variable values with a list of acceptable values, flagging records with values not included in the list. For example, in a data set that represents customers who reside in the United States, a “State” variable should include only 51 acceptable values; records with any other value in this field require analyst review.

Univariate methods for anomaly detection may miss some unusual patterns. To take a simple example, consider the case of a person who measures 74 inches tall and weighs 105 pounds. Neither the height nor the weight of this person is exceptional, but the combination of the two is highly unusual and rare. Analysts use multivariate anomaly detection techniques to identify these unusual cases. Multiple techniques are available to the analyst, including clustering techniques (see later in this chapter), single-class support vector machines, and distance-based techniques (such as K-nearest neighbors). These techniques are useful when anomaly detection is the primary goal of the analysis (as is the case for security and fraud applications); however, they are rarely used in the predictive analytics process.

### Graph and Network Analysis

In Chapter 7, we discussed the graph analysis use case, a form of discovery with proven value in social media analysis, fraud detection, criminology, and national security. Mathematical graphs do not play a direct role in predictive analytics but can play a supporting role in two ways.

First, graphs are very useful in exploratory analysis, where the analyst simply seeks to understand behavior. Bayesian belief networks, discussed next, are a special case of graph analysis, where the nodes of the graph represent variables. However, an analyst can gain valuable insights from other applications of graph analysis, such as social network analysis. In a social graph, the nodes represent persons, and edges represent relationships among persons; using a social graph, a criminologist discovered that most murders in Chicago took place within a very small social network.^{2} This insight can lead the analyst to examine the characteristics that distinguish the high-risk social network and a model that predicts homicide risk.

Graph analysis can also contribute features to a predictive model based on a broader set of data. For example, the social distance between a prospective customer and an existing customer—derived from a social graph—could be a strong feature in a model that predicts response to a marketing offer. As another example, the number of social links between an employee and other employees might be a valuable predictor in an employee retention model.

### Bayesian Networks

Bayesian inference is a formal system of reasoning that reflects something you do in everyday life: use new information to update your beliefs about the probability of an event. For an example of this kind of reasoning, consider a sales associate at a car dealer who must decide how much time to spend with “walk-in” customers. The sales associate knows from experience that only a very small percentage of these customers will buy a car, but he also knows that if the customer currently owns the brand of car sold at the dealership, the odds of a purchase increase significantly. Using a form of Bayesian inference, the sales associate asks each “walk-in” customer what he or she currently drives and then uses this information to qualify the customer accordingly.

Suppose that you have a great deal of data about an entity, and you want to understand what data is most useful for predicting a particular event. For example, you may be interested in modeling loan defaults in a mortgage portfolio and have copious data about the borrower, mortgaged property, and local economic conditions. Bayesian methods help you identify the information value of each data item so that you can focus attention on the most important predictors.

A Bayesian belief network represents a system of relationships among variables through a mathematical graph (described in the preceding section). A belief network represents variables as nodes in the graph and conditional dependencies as edges, as shown in Exhibit 9.2.

Exhibit 9.2 Bayesian Belief Network

Belief networks are highly interpretable; modeling and visualizing a belief network helps the analyst understand relationships among a large set of variables and form hypotheses about the best ways to model those relationships. The belief network models the system as a whole and does not categorize variables as “predictor” and “response” measures. Hence, it is a valuable tool to explore the data while working with a business stakeholder to define the predictive modeling problem. (We discuss Bayesian methods for predictive modeling later in this chapter.)

Most commercial and open source analytics platforms can construct Bayesian belief networks. Specialist software vendor Bayesia offers a special-purpose software package (BayesiaLab) that is especially well suited to visualization, and offers deeper functionality than is available in general-purpose analytic software.

### Text Mining

As we noted in Chapter 7, text and document analytics can be a distinct use case for analytics, where the goal of the analysis is simply to draw insight from the text itself. An example of this kind of “pure” text analysis is the popular “word cloud”—a diagram that visually represents the relative frequency of words in a text (such as a presidential speech).

The explosion of digital content available through electronic channels creates demand for document analytics, a specialized application of text analytics. Document analysis produces measures of similarity and dissimilarity, for example, what organizations use to identify duplicate content, detect plagiarism, or filter unwanted content.

In predictive analytics, text mining plays a supplemental role: Analysts seek to enhance models by incorporating information derived from text into a predictive model that may capture other information about the subject. For example, a hospital seeking to predict readmission among discharged patients relied on a battery of quantitative measures such as diagnostic codes, days since first admission, and other characteristics of the treatment; it was able to improve the model by adding predictors derived from practitioners’ notes with text mining. Similarly, an insurance carrier was able to improve its ability to predict customer attrition by capturing data from call center notes.

The most common form of text mining depends on word counting, but the task is more complicated than simply counting the incidence of each unique word. The analyst must first clean and standardize the text by correcting spelling errors; removing common words such as *the, and, or,* and so forth; stemming, or reducing inflected and derived words to their root; and employing other methods that remove noise from the text.

Word counting begins when the text is clean. Two distinctly different methods are in common usage. The simplest method just counts the incidence of each unique word in each document; for example, in the hospital case, the word-counting algorithm counts the incidence of unique words in each patient’s record. The output of this process is a sparse matrix with one column for each distinct word, one row for each document, and values in the cells representing the word count. This matrix is impossibly large to use in a predictive model in its raw form, so the analyst applies dimension-reduction techniques to reduce the word count matrix to a limited number of uncorrelated dimensions. (See the section on dimension reduction later in this chapter.)

A second method counts associations rather than words. For example, the algorithm counts how often two words appear together within a sliding window of *n* words, within a sentence or within a paragraph. The output of this process is a “words by words” matrix, to which the analyst applies dimension-reduction techniques. This method can produce insights with relatively small quantities of text, but it requires a scoring process to assign feature values to each record in the raw data.

### Clustering

As we discussed in Chapter 7, segmentation is one of the most effective and widely used strategic tools available to businesses today. Strategic segmentation is a business practice that depends on an analytic use case (market segmentation or customer segmentation); the use case, in turn, depends on a set of unsupervised learning techniques called *clustering*.

Clustering techniques divide a set of cases into distinct groups that are homogeneous with respect to a set of variables we call the *active variables*. In customer segmentation, each case represents a customer; in market segmentation, each case represents a consumer who may be a current customer, a former customer, or a prospective customer. Of course, you can use clustering techniques in other domains aside from customer and market segmentation.

Although strategic segmentation is a distinct analytic use case, segmentation can also play a tactical role in predictive analytics. As a rule, analysts can improve the overall effectiveness of a predictive model by splitting the population into subgroups, or segments, and modeling separately for each segment. In some cases, the subgroups are logically apparent and easily identified without formal analysis. Suppose, for example, that a credit card issuer wants to build a model that will predict delinquency in the next 12 months. The model likely includes predictors based on the cardholder’s transacting and payment behavior over some finite period (such as the prior 12 months). Cardholders acquired less than 12 months ago will have incomplete data for these predictors; consequently, it may make sense to segment the cardholder base into two groups: those acquired at least 12 months ago and those acquired less than 12 months ago. The analyst then builds separate predictive models for each group of card-holders. (In actual practice, credit card issuers subdivide their portfolios into many such segments for risk modeling based on a range of characteristics, including cardholder tenure, type of card product, country of issue, and so forth.)

The practice described in the preceding paragraph is *a priori segmentation,* where the analyst knows the desired segmentation scheme in advance. When the analyst does not know the optimal segmentation scheme in advance, clustering techniques help the analyst segment the analysis data set into homogeneous groups. A bookstore, for example, might have data about customer spending across a wide range of categories. Running a cluster analysis reveals (hypothetically) five distinct groups of customers:

- High-spending customers who buy in many categories
- High-spending customers who buy fiction only
- Medium-spending customers who buy mostly children’s books
- Medium-spending customers who buy books on military history, sports, and auto repair
- Light-spending customers

This clustering has business value in its own right, but it also enables the analyst to build distinct predictive models for each segment.

You can use many techniques for clustering; the most widely used is k-means clustering, a technique that minimizes the variation from the cluster mean for all active variables. The standard k-means algorithm is iterative and relies on random seed values; the analyst must specify the value of *k,* or the number of clusters. There are many variations on k-means, including alternative computational methods, and a range of enhancements in software implementations; these include capabilities to visualize and interpret the clusters, and “wrappers” that help the analyst determine the optimal number of clusters.

K-means clustering is available in most commercial data mining packages (together with other clustering methods). Open source options include the k-means package in R (among many others) and scikit-learn in Python. To be useful as a segmentation tool, clustering must run on the entire population; hence, leading database vendors such as IBM, PureData (Netezza), and Oracle have built-in capability for k-means, and leading in-database libraries support the capability as well. In Hadoop, open source implementations are included in Apache Mahout, Apache Spark, and independent platforms such as H2O.

### Dimension Reduction

Analysts tend to use the words *dimension, feature,* and *predictor variable* interchangeably. Although each term has a precise meaning in academic literature, in this section we treat them as synonymous and address the practical problems posed by data sets with a very large number of predictors.

An in-depth treatment of dimensionality and its impact on the techniques reviewed in this chapter is out of scope for this book. Suffice it to say that high dimensionality complicates predictive modeling in two ways: through added computational complexity and runtime, and through the potential to produce a biased or unstable model. In this context, there is no simple rule that defines “large.” On the one extreme, problems in image recognition or genetics may have millions of potential predictors, but with some methods, analysts encounter issues with as few as a thousand or several hundred predictors.

Analysts use two types of techniques to reduce the number of dimensions in a data set: feature extraction and feature selection. As the name suggests, feature extraction methods synthesize information from many raw variables into a limited number of dimensions, extracting signal from noise. Feature selection methods help the analyst choose from a number of predictors, selecting the best predictors for use in the finished model and ignoring the rest.

The most popular technique for feature extraction is principal component analysis, or PCA. First introduced in 1901, PCA is widely used in the social sciences and marketing research; for example, consumer psychologists use the method to draw insights from large batteries of attitudinal data captured in surveys. PCA uses linear algebra to extract uncorrelated dimensions from the raw data. Although the method is well established and relatively easy to implement, it assumes the data are jointly normally distributed, a condition that is often violated in commercial analytics. Variations on PCA include Kernel PCA and Multilinear PCA; there is also a wide range of other advanced methods for feature extraction. Most commercial analytics packages implement PCA; alternatives to PCA are available in open source software.

Many predictive modeling techniques have built-in feature selection capabilities: The technique automatically evaluates and selects from available predictors. These techniques include tree-based methods (such as CART or C5.0); boosted methods (such as ADABoost); bootstrap aggregation, or bagging; regularized methods, such as LARS or LASSO; and stepwise methods. When the modeling technique has built-in feature selection, the analyst can omit the feature selection step from the modeling process; this is a key reason to use these methods.

When the analyst does not want to use a technique with built-in feature selection, several options are available. The analyst can run a forward stepwise procedure (see “Stepwise Regression” later in this chapter) with a low threshold for variable inclusion; this will produce a list of candidate predictors, which the analyst can fine-tune in a second step. Another popular method for feature selection is to run regularized random forests (RRF) analysis, which produces a set of nonredundant variables.

Previously in this chapter, we discussed the value of Bayesian belief networks for exploratory analysis. After building a belief network, the analyst can use it for feature selection. Recall that each node in a belief network represents a variable in the analytic data set. For any given target node (the response measure), the *Markov blanket* consists of all the parent and child nodes that make this node independent of all other nodes in the network.

Whereas feature extraction is more elegant than feature selection and has a long history of academic use, feature selection is the more practical tool. On one hand, feature extraction techniques such as PCA add an additional step to the scoring process, which must score and convert raw data to the principal dimensions before computing a score. On the other hand, predictive models based on feature selection techniques work with data as it exists in production (assuming the analyst worked with data in its raw form).