Home > Articles > Business & Management

  • Print
  • + Share This
This chapter is from the book

Supervised and Unsupervised Learning

In Chapter 7, we reviewed a number of analytic use cases, including text and document analytics, clustering, association, and anomaly detection. These use cases differ from the predictive modeling use case because there is no predefined response measure; the analyst seeks to identify patterns but does not seek to predict or explain a specific relationship. These use cases require unsupervised learning techniques.

Unsupervised learning refers to techniques that find patterns in unlabeled data, or data that lacks a defined response measure. Examples of unlabeled data include a bit-mapped photograph, a series of comments from social media, and a battery of psychographic data gathered from a number of subjects. In each case, it may be possible to classify the objects through an external process: For example, you can ask a panel of oncologists to review a set of breast images and classify them as possibly malignant (or not), but the classification is not a part of the raw source data. Unsupervised learning techniques help the analyst identify data-driven patterns that may warrant further investigation.

Supervised learning, on the other hand, includes techniques that require a defined response measure. Not surprisingly, analysts primarily use supervised learning techniques for predictive analytics. However, in the course of a predictive analytics project, analysts may use unsupervised learning techniques to understand the data and to expedite the model building process. Unsupervised learning techniques frequently used within the predictive modeling process include anomaly detection, graph and network analysis, Bayesian Networks, text mining, clustering, and dimension reduction.

Anomaly Detection

An analyst working on a supermarket chain’s loyalty card spending data noticed an interesting pattern: Some customers appeared to spend exceptionally large amounts. These “supercustomers”—of whom there were no more than several dozen—accounted for a disproportionate percentage of total spending. The analyst was intrigued: Who were these supercustomers? Did it make sense to develop a special program to retain their business (in the same way that casinos target “whales”)?

On deeper investigation—a process that took considerable digging—the analyst discovered that these “supercustomers” were actually store cashiers who swiped their own loyalty cards for customers who did not have a card.

In Chapter 8, “Predictive Analytics Methodology,” we noted that analysts investigate and treat outliers as they develop the analysis data set. They do this for two reasons: First, because outliers can make it very difficult to fit a predictive model to the data at all; and second, because outliers may indicate a problem with the data, as the supermarket analyst learned.

As a rule, the analyst should remove outliers from the analysis data set only when they are artifacts of the data collection process (as is the case in the supermarket example). Investigating outliers can take a considerable amount of time; thus, the analyst needs formal methods to identify anomalies in the data as quickly as possible.

In many cases, simple univariate methods will suffice. For univariate anomaly detection, the analyst runs simple statistics on all numeric variables. The process flags records with values that exceed defined minima or maxima for each variable, and flags records whose values exceed a defined number of standard deviations from the mean. For categorical variables, the analyst compares the variable values with a list of acceptable values, flagging records with values not included in the list. For example, in a data set that represents customers who reside in the United States, a “State” variable should include only 51 acceptable values; records with any other value in this field require analyst review.

Univariate methods for anomaly detection may miss some unusual patterns. To take a simple example, consider the case of a person who measures 74 inches tall and weighs 105 pounds. Neither the height nor the weight of this person is exceptional, but the combination of the two is highly unusual and rare. Analysts use multivariate anomaly detection techniques to identify these unusual cases. Multiple techniques are available to the analyst, including clustering techniques (see later in this chapter), single-class support vector machines, and distance-based techniques (such as K-nearest neighbors). These techniques are useful when anomaly detection is the primary goal of the analysis (as is the case for security and fraud applications); however, they are rarely used in the predictive analytics process.

Graph and Network Analysis

In Chapter 7, we discussed the graph analysis use case, a form of discovery with proven value in social media analysis, fraud detection, criminology, and national security. Mathematical graphs do not play a direct role in predictive analytics but can play a supporting role in two ways.

First, graphs are very useful in exploratory analysis, where the analyst simply seeks to understand behavior. Bayesian belief networks, discussed next, are a special case of graph analysis, where the nodes of the graph represent variables. However, an analyst can gain valuable insights from other applications of graph analysis, such as social network analysis. In a social graph, the nodes represent persons, and edges represent relationships among persons; using a social graph, a criminologist discovered that most murders in Chicago took place within a very small social network.2 This insight can lead the analyst to examine the characteristics that distinguish the high-risk social network and a model that predicts homicide risk.

Graph analysis can also contribute features to a predictive model based on a broader set of data. For example, the social distance between a prospective customer and an existing customer—derived from a social graph—could be a strong feature in a model that predicts response to a marketing offer. As another example, the number of social links between an employee and other employees might be a valuable predictor in an employee retention model.

Bayesian Networks

Bayesian inference is a formal system of reasoning that reflects something you do in everyday life: use new information to update your beliefs about the probability of an event. For an example of this kind of reasoning, consider a sales associate at a car dealer who must decide how much time to spend with “walk-in” customers. The sales associate knows from experience that only a very small percentage of these customers will buy a car, but he also knows that if the customer currently owns the brand of car sold at the dealership, the odds of a purchase increase significantly. Using a form of Bayesian inference, the sales associate asks each “walk-in” customer what he or she currently drives and then uses this information to qualify the customer accordingly.

Suppose that you have a great deal of data about an entity, and you want to understand what data is most useful for predicting a particular event. For example, you may be interested in modeling loan defaults in a mortgage portfolio and have copious data about the borrower, mortgaged property, and local economic conditions. Bayesian methods help you identify the information value of each data item so that you can focus attention on the most important predictors.

A Bayesian belief network represents a system of relationships among variables through a mathematical graph (described in the preceding section). A belief network represents variables as nodes in the graph and conditional dependencies as edges, as shown in Exhibit 9.2.

Exhibit 9.2

Exhibit 9.2 Bayesian Belief Network

Belief networks are highly interpretable; modeling and visualizing a belief network helps the analyst understand relationships among a large set of variables and form hypotheses about the best ways to model those relationships. The belief network models the system as a whole and does not categorize variables as “predictor” and “response” measures. Hence, it is a valuable tool to explore the data while working with a business stakeholder to define the predictive modeling problem. (We discuss Bayesian methods for predictive modeling later in this chapter.)

Most commercial and open source analytics platforms can construct Bayesian belief networks. Specialist software vendor Bayesia offers a special-purpose software package (BayesiaLab) that is especially well suited to visualization, and offers deeper functionality than is available in general-purpose analytic software.

Text Mining

As we noted in Chapter 7, text and document analytics can be a distinct use case for analytics, where the goal of the analysis is simply to draw insight from the text itself. An example of this kind of “pure” text analysis is the popular “word cloud”—a diagram that visually represents the relative frequency of words in a text (such as a presidential speech).

The explosion of digital content available through electronic channels creates demand for document analytics, a specialized application of text analytics. Document analysis produces measures of similarity and dissimilarity, for example, what organizations use to identify duplicate content, detect plagiarism, or filter unwanted content.

In predictive analytics, text mining plays a supplemental role: Analysts seek to enhance models by incorporating information derived from text into a predictive model that may capture other information about the subject. For example, a hospital seeking to predict readmission among discharged patients relied on a battery of quantitative measures such as diagnostic codes, days since first admission, and other characteristics of the treatment; it was able to improve the model by adding predictors derived from practitioners’ notes with text mining. Similarly, an insurance carrier was able to improve its ability to predict customer attrition by capturing data from call center notes.

The most common form of text mining depends on word counting, but the task is more complicated than simply counting the incidence of each unique word. The analyst must first clean and standardize the text by correcting spelling errors; removing common words such as the, and, or, and so forth; stemming, or reducing inflected and derived words to their root; and employing other methods that remove noise from the text.

Word counting begins when the text is clean. Two distinctly different methods are in common usage. The simplest method just counts the incidence of each unique word in each document; for example, in the hospital case, the word-counting algorithm counts the incidence of unique words in each patient’s record. The output of this process is a sparse matrix with one column for each distinct word, one row for each document, and values in the cells representing the word count. This matrix is impossibly large to use in a predictive model in its raw form, so the analyst applies dimension-reduction techniques to reduce the word count matrix to a limited number of uncorrelated dimensions. (See the section on dimension reduction later in this chapter.)

A second method counts associations rather than words. For example, the algorithm counts how often two words appear together within a sliding window of n words, within a sentence or within a paragraph. The output of this process is a “words by words” matrix, to which the analyst applies dimension-reduction techniques. This method can produce insights with relatively small quantities of text, but it requires a scoring process to assign feature values to each record in the raw data.


As we discussed in Chapter 7, segmentation is one of the most effective and widely used strategic tools available to businesses today. Strategic segmentation is a business practice that depends on an analytic use case (market segmentation or customer segmentation); the use case, in turn, depends on a set of unsupervised learning techniques called clustering.

Clustering techniques divide a set of cases into distinct groups that are homogeneous with respect to a set of variables we call the active variables. In customer segmentation, each case represents a customer; in market segmentation, each case represents a consumer who may be a current customer, a former customer, or a prospective customer. Of course, you can use clustering techniques in other domains aside from customer and market segmentation.

Although strategic segmentation is a distinct analytic use case, segmentation can also play a tactical role in predictive analytics. As a rule, analysts can improve the overall effectiveness of a predictive model by splitting the population into subgroups, or segments, and modeling separately for each segment. In some cases, the subgroups are logically apparent and easily identified without formal analysis. Suppose, for example, that a credit card issuer wants to build a model that will predict delinquency in the next 12 months. The model likely includes predictors based on the cardholder’s transacting and payment behavior over some finite period (such as the prior 12 months). Cardholders acquired less than 12 months ago will have incomplete data for these predictors; consequently, it may make sense to segment the cardholder base into two groups: those acquired at least 12 months ago and those acquired less than 12 months ago. The analyst then builds separate predictive models for each group of card-holders. (In actual practice, credit card issuers subdivide their portfolios into many such segments for risk modeling based on a range of characteristics, including cardholder tenure, type of card product, country of issue, and so forth.)

The practice described in the preceding paragraph is a priori segmentation, where the analyst knows the desired segmentation scheme in advance. When the analyst does not know the optimal segmentation scheme in advance, clustering techniques help the analyst segment the analysis data set into homogeneous groups. A bookstore, for example, might have data about customer spending across a wide range of categories. Running a cluster analysis reveals (hypothetically) five distinct groups of customers:

  • High-spending customers who buy in many categories
  • High-spending customers who buy fiction only
  • Medium-spending customers who buy mostly children’s books
  • Medium-spending customers who buy books on military history, sports, and auto repair
  • Light-spending customers

This clustering has business value in its own right, but it also enables the analyst to build distinct predictive models for each segment.

You can use many techniques for clustering; the most widely used is k-means clustering, a technique that minimizes the variation from the cluster mean for all active variables. The standard k-means algorithm is iterative and relies on random seed values; the analyst must specify the value of k, or the number of clusters. There are many variations on k-means, including alternative computational methods, and a range of enhancements in software implementations; these include capabilities to visualize and interpret the clusters, and “wrappers” that help the analyst determine the optimal number of clusters.

K-means clustering is available in most commercial data mining packages (together with other clustering methods). Open source options include the k-means package in R (among many others) and scikit-learn in Python. To be useful as a segmentation tool, clustering must run on the entire population; hence, leading database vendors such as IBM, PureData (Netezza), and Oracle have built-in capability for k-means, and leading in-database libraries support the capability as well. In Hadoop, open source implementations are included in Apache Mahout, Apache Spark, and independent platforms such as H2O.

Dimension Reduction

Analysts tend to use the words dimension, feature, and predictor variable interchangeably. Although each term has a precise meaning in academic literature, in this section we treat them as synonymous and address the practical problems posed by data sets with a very large number of predictors.

An in-depth treatment of dimensionality and its impact on the techniques reviewed in this chapter is out of scope for this book. Suffice it to say that high dimensionality complicates predictive modeling in two ways: through added computational complexity and runtime, and through the potential to produce a biased or unstable model. In this context, there is no simple rule that defines “large.” On the one extreme, problems in image recognition or genetics may have millions of potential predictors, but with some methods, analysts encounter issues with as few as a thousand or several hundred predictors.

Analysts use two types of techniques to reduce the number of dimensions in a data set: feature extraction and feature selection. As the name suggests, feature extraction methods synthesize information from many raw variables into a limited number of dimensions, extracting signal from noise. Feature selection methods help the analyst choose from a number of predictors, selecting the best predictors for use in the finished model and ignoring the rest.

The most popular technique for feature extraction is principal component analysis, or PCA. First introduced in 1901, PCA is widely used in the social sciences and marketing research; for example, consumer psychologists use the method to draw insights from large batteries of attitudinal data captured in surveys. PCA uses linear algebra to extract uncorrelated dimensions from the raw data. Although the method is well established and relatively easy to implement, it assumes the data are jointly normally distributed, a condition that is often violated in commercial analytics. Variations on PCA include Kernel PCA and Multilinear PCA; there is also a wide range of other advanced methods for feature extraction. Most commercial analytics packages implement PCA; alternatives to PCA are available in open source software.

Many predictive modeling techniques have built-in feature selection capabilities: The technique automatically evaluates and selects from available predictors. These techniques include tree-based methods (such as CART or C5.0); boosted methods (such as ADABoost); bootstrap aggregation, or bagging; regularized methods, such as LARS or LASSO; and stepwise methods. When the modeling technique has built-in feature selection, the analyst can omit the feature selection step from the modeling process; this is a key reason to use these methods.

When the analyst does not want to use a technique with built-in feature selection, several options are available. The analyst can run a forward stepwise procedure (see “Stepwise Regression” later in this chapter) with a low threshold for variable inclusion; this will produce a list of candidate predictors, which the analyst can fine-tune in a second step. Another popular method for feature selection is to run regularized random forests (RRF) analysis, which produces a set of nonredundant variables.

Previously in this chapter, we discussed the value of Bayesian belief networks for exploratory analysis. After building a belief network, the analyst can use it for feature selection. Recall that each node in a belief network represents a variable in the analytic data set. For any given target node (the response measure), the Markov blanket consists of all the parent and child nodes that make this node independent of all other nodes in the network.

Whereas feature extraction is more elegant than feature selection and has a long history of academic use, feature selection is the more practical tool. On one hand, feature extraction techniques such as PCA add an additional step to the scoring process, which must score and convert raw data to the principal dimensions before computing a score. On the other hand, predictive models based on feature selection techniques work with data as it exists in production (assuming the analyst worked with data in its raw form).

  • + Share This
  • 🔖 Save To Your Account

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.


Pearson Education, Inc., 221 River Street, Hoboken, New Jersey 07030, (Pearson) presents this site to provide information about products and services that can be purchased through this site.

This privacy notice provides an overview of our commitment to privacy and describes how we collect, protect, use and share personal information collected through this site. Please note that other Pearson websites and online products and services have their own separate privacy policies.

Collection and Use of Information

To conduct business and deliver products and services, Pearson collects and uses personal information in several ways in connection with this site, including:

Questions and Inquiries

For inquiries and questions, we collect the inquiry or question, together with name, contact details (email address, phone number and mailing address) and any other additional information voluntarily submitted to us through a Contact Us form or an email. We use this information to address the inquiry and respond to the question.

Online Store

For orders and purchases placed through our online store on this site, we collect order details, name, institution name and address (if applicable), email address, phone number, shipping and billing addresses, credit/debit card information, shipping options and any instructions. We use this information to complete transactions, fulfill orders, communicate with individuals placing orders or visiting the online store, and for related purposes.


Pearson may offer opportunities to provide feedback or participate in surveys, including surveys evaluating Pearson products, services or sites. Participation is voluntary. Pearson collects information requested in the survey questions and uses the information to evaluate, support, maintain and improve products, services or sites, develop new products and services, conduct educational research and for other purposes specified in the survey.

Contests and Drawings

Occasionally, we may sponsor a contest or drawing. Participation is optional. Pearson collects name, contact information and other information specified on the entry form for the contest or drawing to conduct the contest or drawing. Pearson may collect additional personal information from the winners of a contest or drawing in order to award the prize and for tax reporting purposes, as required by law.


If you have elected to receive email newsletters or promotional mailings and special offers but want to unsubscribe, simply email information@informit.com.

Service Announcements

On rare occasions it is necessary to send out a strictly service related announcement. For instance, if our service is temporarily suspended for maintenance we might send users an email. Generally, users may not opt-out of these communications, though they can deactivate their account information. However, these communications are not promotional in nature.

Customer Service

We communicate with users on a regular basis to provide requested services and in regard to issues relating to their account we reply via email or phone in accordance with the users' wishes when a user submits their information through our Contact Us form.

Other Collection and Use of Information

Application and System Logs

Pearson automatically collects log data to help ensure the delivery, availability and security of this site. Log data may include technical information about how a user or visitor connected to this site, such as browser type, type of computer/device, operating system, internet service provider and IP address. We use this information for support purposes and to monitor the health of the site, identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents and appropriately scale computing resources.

Web Analytics

Pearson may use third party web trend analytical services, including Google Analytics, to collect visitor information, such as IP addresses, browser types, referring pages, pages visited and time spent on a particular site. While these analytical services collect and report information on an anonymous basis, they may use cookies to gather web trend information. The information gathered may enable Pearson (but not the third party web trend services) to link information with application and system log data. Pearson uses this information for system administration and to identify problems, improve service, detect unauthorized access and fraudulent activity, prevent and respond to security incidents, appropriately scale computing resources and otherwise support and deliver this site and its services.

Cookies and Related Technologies

This site uses cookies and similar technologies to personalize content, measure traffic patterns, control security, track use and access of information on this site, and provide interest-based messages and advertising. Users can manage and block the use of cookies through their browser. Disabling or blocking certain cookies may limit the functionality of this site.

Do Not Track

This site currently does not respond to Do Not Track signals.


Pearson uses appropriate physical, administrative and technical security measures to protect personal information from unauthorized access, use and disclosure.


This site is not directed to children under the age of 13.


Pearson may send or direct marketing communications to users, provided that

  • Pearson will not use personal information collected or processed as a K-12 school service provider for the purpose of directed or targeted advertising.
  • Such marketing is consistent with applicable law and Pearson's legal obligations.
  • Pearson will not knowingly direct or send marketing communications to an individual who has expressed a preference not to receive marketing.
  • Where required by applicable law, express or implied consent to marketing exists and has not been withdrawn.

Pearson may provide personal information to a third party service provider on a restricted basis to provide marketing solely on behalf of Pearson or an affiliate or customer for whom Pearson is a service provider. Marketing preferences may be changed at any time.

Correcting/Updating Personal Information

If a user's personally identifiable information changes (such as your postal address or email address), we provide a way to correct or update that user's personal data provided to us. This can be done on the Account page. If a user no longer desires our service and desires to delete his or her account, please contact us at customer-service@informit.com and we will process the deletion of a user's account.


Users can always make an informed choice as to whether they should proceed with certain services offered by InformIT. If you choose to remove yourself from our mailing list(s) simply visit the following page and uncheck any communication you no longer want to receive: www.informit.com/u.aspx.

Sale of Personal Information

Pearson does not rent or sell personal information in exchange for any payment of money.

While Pearson does not sell personal information, as defined in Nevada law, Nevada residents may email a request for no sale of their personal information to NevadaDesignatedRequest@pearson.com.

Supplemental Privacy Statement for California Residents

California residents should read our Supplemental privacy statement for California residents in conjunction with this Privacy Notice. The Supplemental privacy statement for California residents explains Pearson's commitment to comply with California law and applies to personal information of California residents collected in connection with this site and the Services.

Sharing and Disclosure

Pearson may disclose personal information, as follows:

  • As required by law.
  • With the consent of the individual (or their parent, if the individual is a minor)
  • In response to a subpoena, court order or legal process, to the extent permitted or required by law
  • To protect the security and safety of individuals, data, assets and systems, consistent with applicable law
  • In connection the sale, joint venture or other transfer of some or all of its company or assets, subject to the provisions of this Privacy Notice
  • To investigate or address actual or suspected fraud or other illegal activities
  • To exercise its legal rights, including enforcement of the Terms of Use for this site or another contract
  • To affiliated Pearson companies and other companies and organizations who perform work for Pearson and are obligated to protect the privacy of personal information consistent with this Privacy Notice
  • To a school, organization, company or government agency, where Pearson collects or processes the personal information in a school setting or on behalf of such organization, company or government agency.


This web site contains links to other sites. Please be aware that we are not responsible for the privacy practices of such other sites. We encourage our users to be aware when they leave our site and to read the privacy statements of each and every web site that collects Personal Information. This privacy statement applies solely to information collected by this web site.

Requests and Contact

Please contact us about this Privacy Notice or if you have any requests or questions relating to the privacy of your personal information.

Changes to this Privacy Notice

We may revise this Privacy Notice through an updated posting. We will identify the effective date of the revision in the posting. Often, updates are made to provide greater clarity or to comply with changes in regulatory requirements. If the updates involve material changes to the collection, protection, use or disclosure of Personal Information, Pearson will provide notice of the change through a conspicuous notice on this site or other appropriate way. Continued use of the site after the effective date of a posted revision evidences acceptance. Please contact us if you have questions or concerns about the Privacy Notice or any objection to any revisions.

Last Update: November 17, 2020