2.6 The Bias–Variance Tradeoff
The bias–variance tradeoff is a fundamental concept in machine learning, representing the balance between the model’s ability to accurately capture the training set (its bias) versus its ability to generalize well to unseen data (its variance) [220].
In general, the generalization error of a model can be decomposed into three components:
Bias: The systematic error caused by incorrect or overly simplistic model assumptions, which leads to consistent deviations between the model’s predictions and the true values. A model with high bias is unable to capture the complexities and underlying patterns in the data, resulting in poor performance on both the training set and unseen data. Such a model is said to be underfitting the training data.
Variance: The error due to the model’s sensitivity to fluctuations in the training data, reflecting how much the model’s predictions change when trained on different datasets. A model with high variance adapts too closely to the specific details of the training data (including the noise), which often results in poor generalization to new data. Such a model is said to be overfitting the training data.
Irreducible error: The portion of error caused by the inherent noise in the data, arising from factors such as measurement inaccuracies, data entry mistakes, or other unpredictable influences. This error cannot be eliminated by any model, regardless of its complexity.
Theorem 2.1 establishes the relationship between these three components.
Theorem 2.1. Generalization Error = Bias2 + Variance + Noise
First, we introduce a lemma necessary for the proof of Theorem 2.1.
Lemma 1. If A and B are two independent random variables and 𝔼[B] = 0, then 𝔼[(A + B)2] = 𝔼[A2] + 𝔼[B2].
Proof. Since A and B are independent, we have 𝔼[AB] = 𝔼[A] 𝔼[B] = 0 (see Section C.9.5). The proof now follows from expanding the square and using the linearity of expectation:
With the lemma established, we now prove Theorem 2.1, focusing on regression problems5:
Proof. We consider the following setup:
The targets are generated by an unknown function f(x) combined with some random noise ϵ, where ϵ has a mean of 0 and a standard deviation of σ. This noise represents the irreducible error, capturing the randomness inherent in the data:
We train our model on a randomly drawn training set
.The generalization error is measured as the expected squared error of the model’s prediction on a test sample (x, y), averaged over the random draw of the training set D:
Here, hD is the hypothesis of the model that was trained on D, y is the true label, and 𝔼D represents the expectation over all possible training sets. The expected MSE (Mean Squared Error) on a given sample x quantifies the model’s average performance considering variability in the training data selection.
We now decompose the MSE into bias and variance terms:
The bias term represents the expected deviation between the true underlying function and the model’s hypothesis, averaged over all the possible training datasets. Essentially, the bias quantifies the portion of generalization error caused by the model’s limited ability to capture the true data-generating process. Conversely, the variance term measures the model’s sensitivity to fluctuations in the training data, indicating how much its predictions change with different training sets. Lastly, the irreducible error represents the noise intrinsically present in the problem itself, which no model can eliminate.
Adjusting the model to decrease the bias, e.g., by adding more parameters or employing a more complex architecture, tends to increase its variance. Similarly, reducing the variance, often by simplifying the model, tends to increase the bias. This reciprocal relationship between bias and variance is referred to as the bias–variance tradeoff.
2.6.1 Model Capacity
A primary way to control the bias–variance tradeoff is by adjusting the model’s capacity, also known as its complexity, expressive power, or flexibility.
A model that is too simple to capture the underlying patterns in the data will have high bias and low variance (Figure 2.8, left). Such models tend to perform poorly on both the training and test sets. This situation is analogous to a novice zoologist who identifies animals based on only one or two traits—such as assuming that all animals with feathers are birds—leading to errors like misclassifying bats or other non-avian species.
Conversely, a complex model with excessively high capacity may overfit the training data, leading to low bias but high variance (Figure 2.8, right). These models tend to memorize the specific details of the training data rather than learning the general patterns, resulting in excellent performance on the training set but poor generalization to new data. This is similar to a zoologist with a photographic memory who, when encountering a new bird species, concludes it is not a bird simply because its tail feathers have a unique pattern not previously observed.
Figure 2.8: Illustration of the bias–variance tradeoff across different model complexities in supervised learning. Underfitting is depicted on the left with a simple linear curve, indicating high bias and low variance, failing to capture the complexity of the data. In the middle, an appropriately fitting model is shown using a quadratic curve, balancing bias and variance to effectively model the underlying data trend. On the right, overfitting is represented by a highly complex curve, which closely follows all data points, including the noisy ones, resulting in low bias but high variance.
The ideal model strikes the right balance between bias and variance (Figure 2.8, middle). Such a model typically has a training error that is slightly lower than its test error, indicating a good fit to the data without overfitting.
Models with insufficient capacity struggle to handle complex tasks, whereas those with high capacity can solve complex tasks, but if their capacity exceeds what is necessary for the task, they risk overfitting. The optimal model is the one whose capacity matches the true complexity of the task and the amount of available training data (see Figure 2.9).
Figure 2.9: The relationship between model capacity and error. Initially, when we increase capacity, both the training and generalization errors decrease. When the capacity becomes too large, the generalization error starts increasing while the training error keeps decreasing. This is where we move from the underfitting zone to the overfitting zone.
Adjusting the capacity of a model can be done in several ways. One way is to modify the model’s hypothesis space. For example, in a linear regression model, the hypothesis space consists of linear functions of the input. Expanding this space to include polynomial functions increases the model’s capacity. Another way to control the model capacity is by applying regularization techniques, which penalize complex models and help prevent overfitting (see Section 2.6.2).
Statistical learning theory, developed by Vapnik and Chervonenkis in the late 1960s [596], provides a theoretical foundation for understanding the tradeoff between model complexity and generalization in machine learning. It introduces concepts such as the Vapnik-Chervonenkis (VC) dimension, which measures a model’s capacity to fit a wide range of functions. This theory justifies minimizing the empirical risk (training error) rather than the expected risk (generalization error) when the hypothesis class H is sufficiently restricted. These advanced topics, essential for a deeper understanding of machine learning, will be further explored in Volume III of this book series.
2.6.2 Regularization
Regularization is a widely used technique in machine learning to control the model capacity and prevent overfitting. A common form of regularization involves adding a penalty term to the cost function that increases with the model’s complexity:
Here, Remp(h) is the empirical risk (the model’s error on the training set), C(h) is a penalty term, and λ is a regularization coefficient that controls the bias–variance tradeoff. Increasing λ imposes a greater penalty on model complexity, leading to simpler models with higher bias and lower variance. The optimal value of λ is typically determined empirically via cross-validation.
The complexity of the model can be measured in various ways. For example, in parametric models, it is common to use the norm of the parameter vector ∥θ‖ as a measure of complexity, where different norms lead to different types of penalties (see Section 4.11).
Minimizing the regularized function J(h) is known as structural risk minimization, as opposed to empirical risk minimization, which focuses only on minimizing the training error. Structural risk minimization aims to achieve a balance between fitting the training data well (minimizing empirical risk) and keeping the model simple enough to ensure good generalization to unseen data (through regularization).
To summarize, strategies to reduce underfitting (high bias) include:
Increasing the model complexity: Introduce more parameters or use more complex algorithms to capture subtle patterns in the data.
Adding more features: Incorporate additional predictors to provide the model with more information and improve its predictive power.
Reducing regularization: Decrease the strength of regularization to allow the model to fit the training data more closely.
Strategies to reduce overfitting (high variance) include:
Simplifying the model: Use fewer parameters or simpler algorithms to prevent the model from capturing noise as signal.
Using more training data: More samples provide more information, which helps the model to generalize better rather than memorizing the training data.
Applying regularization: Introduce regularization or increase its strength to penalize overly complex models and promote simpler solutions.
Overall, striking the right balance between bias and variance is a central challenge in model development and a key focus area in machine learning research.





