- 2.1 Formal Definition
- 2.2 Machine Learning Models
- 2.3 The Data-Generating Process
- 2.4 Generalization Error and Empirical Risk Minimization
- 2.5 Parameter Estimation
- 2.6 The Bias-Variance Tradeoff
- 2.7 Building a Machine Learning Model
- 2.8 Challenges in Supervised Learning
- 2.9 Summary
- 2.10 Exercises
2.4 Generalization Error and Empirical Risk Minimization
Building on the probabilistic framework introduced in the previous section, we can now formally define the objective of supervised learning: finding a model that minimizes the expected prediction error on data drawn from the underlying data-generating distribution. This quantity is known as the expected risk and also referred to as the true risk, generalization error, or out-of-sample error.
Formally, the expected risk of a hypothesis h, denoted by R(h), is defined as the expected value of the loss function with respect to the joint distribution p(x, y) of inputs and labels:
The ultimate goal in supervised learning is to find a hypothesis h*(x) that minimizes the expected risk:
However, since the data-generating distribution p(x, y) is typically unknown, we cannot compute the expected risk directly. Instead, we approximate it using the empirical risk (also referred to as the training error or cost function), which is calculated by averaging the loss function over the given training set:
where n is the number of samples in the training set.
The empirical risk serves as a practical surrogate for the theoretical generalization error, providing an estimate of how well the hypothesis h performs on the observed data. When the model is parameterized by a vector θ, the training error is often denoted by J(θ).
In empirical risk minimization (ERM), the supervised learning algorithm aims to find a function h that minimizes the empirical risk:
This approach minimizes the observed discrepancies between predicted and actual outcomes within the training dataset, using the empirical risk as a proxy for the true risk.
A learning algorithm is said to generalize well if the hypothesis it produces performs well not only on the training data but also on new, unseen data drawn from the same distribution. In other words, a model that generalizes well achieves a small generalization error, indicating strong predictive performance beyond the training set.
A related but stronger concept is known as consistency. A learning algorithm is said to be consistent if, as the number of training samples grows large, the empirical risk converges to the true risk:
where hn denotes the hypothesis produced by the algorithm when trained on a dataset of size n. Consistency ensures that, given enough data, minimizing the empirical risk will also minimize the generalization error.4





