Supervised Machine Learning

By Roi Yehoshua
Feb 24, 2026

📄 Contents

␡

⎙ Print

< Back Page 5 of 10 Next >

This chapter is from the book 

Machine Learning Foundations, Volume 1: Supervised Learning

Learn More Buy

2.5 Parameter Estimation

In parameterized models, our goal is to estimate the parameters that best explain the relationship between inputs and outputs while minimizing the generalization error. Two fundamental approaches to parameter estimation are maximum likelihood estimation (MLE) and Bayesian inference. While these methods are discussed in detail in the Statistics appendix (see Sections D.5 and D.8), we focus here on their application to supervised learning.

2.5.1 Maximum Likelihood Estimation (MLE)

Maximum likelihood estimation seeks to determine the model parameters that maximize the probability of the observed data under the assumed probabilistic model.

The training dataset is given by D = {(x₁, y₁), . . . , (x_n, y_n)}, where each pair is assumed to be drawn i.i.d. from the data-generating distribution p(x, y). We further assume that the labels y are generated by a conditional distribution p(y|x; θ), where θ represents the model parameters. Under this assumption, the joint distribution factorizes as p(x, y; θ) = p(y|x; θ)p(x).

The objective of maximum likelihood estimation is to find the parameter values that maximize the likelihood of observing the given labels in the training data:

Click to view larger image

In practice, we often maximize the log-likelihood instead of the likelihood itself to simplify computations:

Click to view larger image

A common practice in supervised learning is to define the loss function as the negative log-likelihood (NLL) of the observed labels under the model. Given an assumed conditional probability distribution p(y|x; θ), the NLL loss for a single training example is defined as:

Click to view larger image

The empirical risk is then computed as the average of the NLL over the training set:

Click to view larger image

This formulation directly connects empirical risk minimization to maximum likelihood estimation, as minimizing the average NLL is equivalent to maximizing the log-likelihood of the training data.

Many common loss functions in supervised learning can be interpreted as instances of this negative log-likelihood framework. For example, the squared loss in regression corresponds to assuming that the labels are generated by a Gaussian distribution (see Section 4.2.1), while the cross-entropy loss in classification arises from modeling the labels with a Bernoulli or categorical distribution (see Section 5.3.1).

2.5.2 Bayesian Inference

The Bayesian approach offers a different perspective: instead of seeking a single best estimate for the parameters, it treats the parameters θ as random variables and computes a posterior distribution over them using Bayes’ theorem:

Click to view larger image

where D is the training set, p(θ) is the prior distribution expressing prior beliefs about the parameters, and p(D|θ) is the likelihood of the data given those parameters.

The Bayesian framework allows for integrating prior knowledge into the learning process and provides a full distribution over the parameters rather than just point estimates. This can be particularly useful when data is limited or noisy, or when quantifying uncertainty in predictions is important.

On the other hand, Bayesian inference also presents several challenges: computing the posterior distribution can be computationally intensive, especially for complex models; the choice of appropriate priors can be subjective; and the probabilistic nature of Bayesian methods may feel less intuitive to those used to point estimates in frequentist approaches.

2.5.3 Example: Medical Diagnosis

To illustrate these two approaches, consider the task of predicting whether a patient has a particular disease based on diagnostic test results and patient characteristics. Under the MLE framework, we might assume that the labels (disease present or not) are drawn from a Bernoulli distribution, where the parameter represents the probability of the disease being present. The model parameters are then estimated by maximizing the likelihood of the observed labels in the training data, without incorporating any prior assumptions.

In contrast, the Bayesian approach allows us to introduce prior knowledge about the disease prevalence in the population, informed by epidemiological studies or expert knowledge. The likelihood is again determined by how well the model explains the observed data, such as the results of diagnostic tests or patient risk factors. Based on this likelihood, the Bayesian model updates its posterior distribution over the model parameters. As more patient data becomes available, the influence of the initial prior diminishes, and the posterior distribution becomes increasingly shaped by the observed data.

In summary, MLE and Bayesian inference provide two complementary approaches to parameter estimation in supervised learning. MLE seeks the parameter values that maximize the likelihood of the observed data, while Bayesian inference combines prior beliefs with observed data to produce a posterior distribution over the parameters. Both approaches recur throughout this book, as they form the foundation of many machine learning methods.

< Back Page 5 of 10 Next >

🔖 Save To Your Account

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Email Address

Supervised Machine Learning

This chapter is from the book

This chapter is from the book

This chapter is from the book 

2.5 Parameter Estimation

2.5.1 Maximum Likelihood Estimation (MLE)

2.5.2 Bayesian Inference

2.5.3 Example: Medical Diagnosis

InformIT Promotional Mailings & Special Offers