Home > Articles

Supervised Machine Learning

This chapter is from the book

Supervised machine learning is a subfield of machine learning focused on building models from labeled data in order to make predictions for new, unseen examples. It is the most widely studied and applied branch of machine learning.

In supervised learning, we are given a dataset that consists of examples paired with their correct outputs—much like a student learning from a teacher who provides questions along with the correct answers. For example, we can use supervised learning to build a spam filter by training on a collection of emails labeled as spam or non-spam. This data-driven approach typically yields filters that are more effective and robust than manually crafted rule-based systems, which tend to become outdated as spamming strategies evolve.

Supervised machine learning is widely applied across many domains, such as:

  • Fraud detection systems protect financial institutions from malicious activity.

  • Medical diagnosis systems are able to detect a wide range of diseases, often with precision that rivals that of medical professionals.

  • Speech recognition systems translate voice to text more accurately and faster than humans.

This chapter introduces the key principles and foundational theories of supervised machine learning, setting the stage for a deeper exploration of learning algorithms and their applications in subsequent chapters. The chapter is organized as follows. Section 2.1 formally defines the supervised machine learning problem and distinguishes between its main types, such as regression and classification. Section 2.2 explores the key components of a machine learning model, including parameters, hyperparameters, loss functions, optimization methods, and evaluation metrics. Section 2.3 discusses the probabilistic framework underlying supervised learning, introducing the concepts of the data-generating process and Bayes error, which defines the theoretical limit of classification accuracy. Section 2.4 formalizes the learning objective through the principle of empirical risk minimization, explaining how learning algorithms approximate the expected prediction error using finite training data. Section 2.5 presents two fundamental approaches to parameter estimation in supervised learning: maximum likelihood estimation (MLE) and Bayesian inference. Section 2.6 explores the bias–variance tradeoff, a key concept for understanding the relationship between model capacity and generalization performance. It also discusses strategies for controlling model capacity, including regularization techniques. Section 2.7 outlines the practical steps involved in constructing a supervised learning model, from data preparation to model selection and evaluation. Section 2.8 discusses common challenges in supervised learning, such as data quality issues, imbalanced classes, and overfitting. The chapter concludes with a summary of key concepts and a set of exercises designed to reinforce understanding.

2.1 Formal Definition

In supervised machine learning problems, we are given a dataset of n labeled samples (also called examples, instances, observations, or data points). Each sample in the dataset is a pair consisting of a vector x, which contains the features (or attributes) of that sample, and its corresponding label (or target) y.

For example, in a spam filter application, x consists of the email attributes, such as its subject, sender, recipients, and the words in its body, and y is a binary label that indicates whether the email is a spam (y = 1) or a ham (y = 0). In an image classification task, x is a vector that contains all the raw pixels of the image, and y is the class to which the image belongs (e.g., y = 0 represents a car, y = 1 indicates a flower, etc.).

If we denote by d the number of features in the dataset (d stands for dimensionality), then x is a d-dimensional vector1:

Collectively, we denote the given dataset by D:

The feature vectors are often stored together in a matrix called a design matrix (or feature matrix) and denoted by X:

The rows of X represent the samples and the columns represent the features, such that xij is the value of the j-th feature associated with the i-th sample. For example, the Iris dataset, which contains 150 samples with four features each, can be represented by a design matrix X150×4. The design matrix allows the learning algorithm to apply algebraic operations on the dataset as a single entity. For example, solving linear regression problems typically involves computing the inverse of XT X (see Section 4.4).

Figure 2.1

Figure 2.1: Illustration of supervised learning using a spam filter model. The model takes an input feature vector x, derived from incoming emails, and predicts a label y indicating whether an email is “Spam” or “Ham” (non-spam).

The goal in supervised learning is to find a function f : 𝒳𝒴 that accurately maps feature vectors x in the input space χ to their corresponding labels y in the output space Y (see Figure 2.1). The exact form of the true mapping f is usually unknown to us, and must be inferred from a limited number of samples annot_page45_1.jpg, which are often noisy and cover only a small part of the input space.

Since our learning is based on a finite and imperfect set of training samples, we can only obtain an approximation of the true mapping f(x). This approximation is known as a hypothesis, denoted by h(x). The model’s hypothesis is an element in some hypothesis space H, which contains the set of all functions that the learning algorithm can choose from as possible solutions. For example, in linear regression, H consists of all linear functions of x, while in decision trees, H includes all functions that can be represented as a sequence of binary decisions.

2.1.1 Regression versus Classification

We distinguish between two types of supervised learning problems:

  • In regression problems, the label y is a continuous value (y ∊ R). For example, in a house price prediction task, y is a real positive value that represents the price of the house. Regression models are often referred to as regressors.

  • In classification problems, the label y is discrete, i.e., it can take one of k values (y ∊ 1, . . . , k), representing the class to which the sample belongs. Classification models are often referred to as classifiers.

We further distinguish between two subtypes of classification problems:

  • – In binary classification, there are only two classes (k = 2). For example, spam detection is a binary classification problem, in which the label can be either y = 1 (spam) or y = 0 (ham).

  • – In multi-class classification, there are more than two classes (k > 2). For example, a handwritten digit recognition task is a multi-class problem with k = 10 classes (one for each possible digit 0–9).

Beyond the way labels are defined, regression and classification also differ in the structure of their outputs. Regression models predict continuous values by fitting a function that closely approximates the observed data points, whereas classification models learn decision boundaries that partition the input space into distinct regions, each corresponding to a different class (see Figure 2.2).

Figure 2.2

Figure 2.2: Regression fits a continuous function to the data points, while classification seeks to establish distinct boundaries that separate different classes.

2.1.2 Deterministic versus Probabilistic Classification

Classification methods are often classified into two main types:

  • Deterministic classifiers output a specific class label for a given input (a hard label), without providing a measure of uncertainty or confidence for their predictions. For example, a deterministic spam filter would classify a given email as either spam or non-spam, without indicating any uncertainty or probability.

  • Probabilistic classifiers provide probability estimates P(y = k|x) for each of the k possible classes the input can belong to (a soft label). These probabilities reflect the model’s confidence in its predictions. For example, a probabilistic spam filter might say that an email has a 80% chance of being spam and 20% chance of being ham.

In general, probabilistic classifiers are harder to implement, but they provide valuable insights into the model’s uncertainty, making them especially useful in domains such as medical diagnosis and financial forecasting, where understanding the confidence of a prediction is crucial.

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.