Supervised Machine Learning

By Roi Yehoshua
Feb 24, 2026

📄 Contents

␡

2.1 Formal Definition
2.2 Machine Learning Models
2.3 The Data-Generating Process
2.4 Generalization Error and Empirical Risk Minimization
2.5 Parameter Estimation
2.6 The Bias-Variance Tradeoff
2.7 Building a Machine Learning Model
2.8 Challenges in Supervised Learning
2.9 Summary
2.10 Exercises

⎙ Print

< Back Page 4 of 10 Next >

This chapter is from the book 

Machine Learning Foundations, Volume 1: Supervised Learning

Learn More Buy

2.4 Generalization Error and Empirical Risk Minimization

Building on the probabilistic framework introduced in the previous section, we can now formally define the objective of supervised learning: finding a model that minimizes the expected prediction error on data drawn from the underlying data-generating distribution. This quantity is known as the expected risk and also referred to as the true risk, generalization error, or out-of-sample error.

Formally, the expected risk of a hypothesis h, denoted by R(h), is defined as the expected value of the loss function with respect to the joint distribution p(x, y) of inputs and labels:

Click to view larger image

The ultimate goal in supervised learning is to find a hypothesis h^*(x) that minimizes the expected risk:

Click to view larger image

However, since the data-generating distribution p(x, y) is typically unknown, we cannot compute the expected risk directly. Instead, we approximate it using the empirical risk (also referred to as the training error or cost function), which is calculated by averaging the loss function over the given training set:

Click to view larger image

where n is the number of samples in the training set.

The empirical risk serves as a practical surrogate for the theoretical generalization error, providing an estimate of how well the hypothesis h performs on the observed data. When the model is parameterized by a vector θ, the training error is often denoted by J(θ).

In empirical risk minimization (ERM), the supervised learning algorithm aims to find a function h that minimizes the empirical risk:

Click to view larger image

This approach minimizes the observed discrepancies between predicted and actual outcomes within the training dataset, using the empirical risk as a proxy for the true risk.

A learning algorithm is said to generalize well if the hypothesis it produces performs well not only on the training data but also on new, unseen data drawn from the same distribution. In other words, a model that generalizes well achieves a small generalization error, indicating strong predictive performance beyond the training set.

A related but stronger concept is known as consistency. A learning algorithm is said to be consistent if, as the number of training samples grows large, the empirical risk converges to the true risk:

Click to view larger image

where h_n denotes the hypothesis produced by the algorithm when trained on a dataset of size n. Consistency ensures that, given enough data, minimizing the empirical risk will also minimize the generalization error.⁴

< Back Page 4 of 10 Next >

🔖 Save To Your Account

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Email Address

Supervised Machine Learning

This chapter is from the book

This chapter is from the book

This chapter is from the book 

2.4 Generalization Error and Empirical Risk Minimization

InformIT Promotional Mailings & Special Offers