2.5 Parameter Estimation
In parameterized models, our goal is to estimate the parameters that best explain the relationship between inputs and outputs while minimizing the generalization error. Two fundamental approaches to parameter estimation are maximum likelihood estimation (MLE) and Bayesian inference. While these methods are discussed in detail in the Statistics appendix (see Sections D.5 and D.8), we focus here on their application to supervised learning.
2.5.1 Maximum Likelihood Estimation (MLE)
Maximum likelihood estimation seeks to determine the model parameters that maximize the probability of the observed data under the assumed probabilistic model.
The training dataset is given by D = {(x1, y1), . . . , (xn, yn)}, where each pair is assumed to be drawn i.i.d. from the data-generating distribution p(x, y). We further assume that the labels y are generated by a conditional distribution p(y|x; θ), where θ represents the model parameters. Under this assumption, the joint distribution factorizes as p(x, y; θ) = p(y|x; θ)p(x).
The objective of maximum likelihood estimation is to find the parameter values that maximize the likelihood of observing the given labels in the training data:
In practice, we often maximize the log-likelihood instead of the likelihood itself to simplify computations:
A common practice in supervised learning is to define the loss function as the negative log-likelihood (NLL) of the observed labels under the model. Given an assumed conditional probability distribution p(y|x; θ), the NLL loss for a single training example is defined as:
The empirical risk is then computed as the average of the NLL over the training set:
This formulation directly connects empirical risk minimization to maximum likelihood estimation, as minimizing the average NLL is equivalent to maximizing the log-likelihood of the training data.
Many common loss functions in supervised learning can be interpreted as instances of this negative log-likelihood framework. For example, the squared loss in regression corresponds to assuming that the labels are generated by a Gaussian distribution (see Section 4.2.1), while the cross-entropy loss in classification arises from modeling the labels with a Bernoulli or categorical distribution (see Section 5.3.1).
2.5.2 Bayesian Inference
The Bayesian approach offers a different perspective: instead of seeking a single best estimate for the parameters, it treats the parameters θ as random variables and computes a posterior distribution over them using Bayes’ theorem:
where D is the training set, p(θ) is the prior distribution expressing prior beliefs about the parameters, and p(D|θ) is the likelihood of the data given those parameters.
The Bayesian framework allows for integrating prior knowledge into the learning process and provides a full distribution over the parameters rather than just point estimates. This can be particularly useful when data is limited or noisy, or when quantifying uncertainty in predictions is important.
On the other hand, Bayesian inference also presents several challenges: computing the posterior distribution can be computationally intensive, especially for complex models; the choice of appropriate priors can be subjective; and the probabilistic nature of Bayesian methods may feel less intuitive to those used to point estimates in frequentist approaches.
2.5.3 Example: Medical Diagnosis
To illustrate these two approaches, consider the task of predicting whether a patient has a particular disease based on diagnostic test results and patient characteristics. Under the MLE framework, we might assume that the labels (disease present or not) are drawn from a Bernoulli distribution, where the parameter represents the probability of the disease being present. The model parameters are then estimated by maximizing the likelihood of the observed labels in the training data, without incorporating any prior assumptions.
In contrast, the Bayesian approach allows us to introduce prior knowledge about the disease prevalence in the population, informed by epidemiological studies or expert knowledge. The likelihood is again determined by how well the model explains the observed data, such as the results of diagnostic tests or patient risk factors. Based on this likelihood, the Bayesian model updates its posterior distribution over the model parameters. As more patient data becomes available, the influence of the initial prior diminishes, and the posterior distribution becomes increasingly shaped by the observed data.
In summary, MLE and Bayesian inference provide two complementary approaches to parameter estimation in supervised learning. MLE seeks the parameter values that maximize the likelihood of the observed data, while Bayesian inference combines prior beliefs with observed data to produce a posterior distribution over the parameters. Both approaches recur throughout this book, as they form the foundation of many machine learning methods.





