Supervised Machine Learning

By Roi Yehoshua
Feb 24, 2026

📄 Contents

␡

⎙ Print

< Back Page 10 of 10

This chapter is from the book 

Machine Learning Foundations, Volume 1: Supervised Learning

Learn More Buy

2.10 Exercises

2.10.1 Multiple-Choice Questions

Circle all the correct choices. There may be more than one correct choice, but there is always at least one correct choice.

2.1. What does the data-generating distribution represent?

The real-world process by which the data is generated
The probability distribution from which the training data is drawn
The distribution of features in the training set
The distribution used by generative models to create new data samples

2.2. What does the generalization error of a model represent?

The error rate of the model on the test data
The variance of the model’s predictions on the test data
The expected value of the error on new input
The difference in error between training and testing data

2.3. Maximum Likelihood Estimation (MLE) in supervised learning is used to:

Maximize the probability of observing the given data under the model parameters
Minimize the error between the predicted and actual values
Estimate the probability of the model being correct given the data
Update the model parameters based on prior beliefs and the likelihood of the observed data

2.4. Generative models:

model the boundary between classes.
model the input distribution in individual classes.
can be used for classification tasks.
can only be used with continuous data, not categorical data.

2.5. Bayesian statistics in supervised learning is used for what purpose?

To reduce the amount of data needed for model training
To model uncertainty in predictions by producing probability distributions over outcomes
To estimate the parameters of a model probabilistically
To incorporate prior knowledge into the model

2.6. Which of the following statements are true about gradient descent?

Gradient descent is used to minimize the cost function in machine learning models.
When the function is convex, gradient descent is guaranteed to find the global minimum.
Choosing a high learning rate causes gradient descent to converge more rapidly.
Gradient descent requires the function to be twice differentiable.

2.7. The bias–variance tradeoff in machine learning suggests that:

As bias decreases, variance tends to increase, and vice versa.
Low bias is usually preferable to low variance.
Reducing variance leads to better model performance on unseen data.
High bias can lead to underfitting, while high variance can lead to overfitting.
A perfect model has neither bias nor variance.

2.8. Reducing the regularization strength λ will lead to:

lower bias
lower variance
higher bias
higher variance

2.9. Which of the following strategies help to reduce overfitting?

Increase the size of the training set.
Use a more complex model.
Perform feature selection to reduce the number of input variables.
Use cross-validation to ensure the model performs well on unseen data.

2.10. Which of the plots in Figure 2.11 shows a hypothesis that underfits the training data?

Figure 2.11: Four models with different levels of model complexity. Which one underfits the data?

2.10.2 Theoretical Exercises

2.11. Identify which of the following problems can be framed as a supervised machine learning problem. For each such problem, clearly state what are the inputs and the outputs (labels), and whether it is a classification or a regression problem.

Build a medical diagnosis application that recommends treatments to patients based on their symptoms and their characteristics such as gender, age, blood pressure, and outcomes of various tests.
Build a chatbot application that provides customer support to the user.
Flag offensive content in a social media application.
Learn to play chess.
Detect the locations of ships in satellite images.
Build a movie recommendation system that suggests new movies to users based on their favorite movies from the past.

2.12. Consider the problem of classifying tumors as either benign or malignant based on their measured size (in millimeters). The sizes of tumors from the two classes are modeled as follows:

Malignant tumors: size x ∼ 𝒩 (40, 8²)
Benign tumors: size x ∼ 𝒩 (20, 6²)

Assume that malignant and benign tumors are equally likely in the population (i.e., the prior probabilities for both classes are 0.5).

Write the expression for the posterior probability P (Malignant|x) using Bayes’ theorem.
Determine the decision rule of the Bayes optimal classifier. Specifically, find the value x^* where the two class-conditional densities are equal.
Using the value of x^* found in part (b), write the formula for the Bayes error E^*.
Explain why the Bayes error represents the lowest possible classification error achievable, even by an ideal classifier.

2.13. In empirical risk minimization (ERM), the learning problem is formulated as the following optimization problem:

where R_emp(h) is the empirical risk, defined as the average loss over the training set.

Explain why this problem is considered an optimization problem. Specify what constitutes the objective function, the variables, and the feasible set.
What properties of the loss function and the hypothesis space H would ensure that the empirical risk minimization problem has a unique global minimum?
Why are iterative optimization methods (such as gradient descent) commonly used in practice to solve ERM problems, even when the optimization objective is well-defined?

2.14. In binary classification, each label y ∊ {0, 1} is often assumed to be generated from a Bernoulli distribution⁶ with success probability p, where p represents the probability of the label being 1.

Write the likelihood function for a single labeled data point (x, y), assuming the model predicts the probability of class 1 as p.
Suppose we have n independent data points (x₁, y₁), . . . , (x_n, y_n), where the model assigns probability p_i to class 1 for input x_i. Write the likelihood function for the dataset.
Derive the log-likelihood function for the dataset.
Write down the negative log-likelihood (NLL) for the dataset.

2.15. Consider a binary classification problem where the goal is to detect fraudulent transactions. Assume that:

Only 2% of the transactions in the population are actually fraudulent.
The classifier has 90% sensitivity (i.e., the probability of correctly identifying a fraudulent transaction).
The classifier has 85% specificity (i.e., the probability of correctly identifying a legitimate transaction).

Calculate the probability that a transaction is actually fraudulent given that the classifier predicts it as fraudulent.
How does this probability change if the prevalence of fraud in the population increases from 2% to 10%? Explain intuitively why the prevalence affects the result.
Calculate the false positive rate of the classifier (i.e., the probability that the classifier predicts fraud when the transaction is actually legitimate).

2.16. In supervised learning, model parameters θ are often estimated by either maximum likelihood estimation (MLE) or Bayesian inference.

Write down the expression for the posterior distribution p(θ|D) of the model parameters given the training data , according to Bayes’ theorem.
Show that if the prior distribution p(θ) is uniform, maximizing the posterior probability is equivalent to maximizing the likelihood function p(D|θ), and hence equivalent to maximum likelihood estimation.
Briefly explain what this result means in the context of supervised learning.

2.17. Which of the following is likely to decrease the bias of a model and increase its variance? Justify your answer in each case.

Add more training data.
Add more features.
Use a more complex model architecture.
Train the model for longer time.
Remove outliers from the dataset.
Decrease regularization strength.

2.18. (*) Read the foundational paper on the No Free Lunch (NFL) theorem by Wolpert and Macready (1997) [630] and answer the following questions based on your understanding:

What is the No Free Lunch theorem, and how does it challenge the notion of a universally best model in machine learning?
Provide a real-world example that illustrates the impact of the No Free Lunch theorem on model selection.
How does the NFL theorem influence the approach to algorithm design and evaluation in practice?

2.10.3 Programming Exercises

2.19. (Maximum Likelihood Estimation) Implement maximum likelihood estimation to estimate the parameters of a Gaussian distribution (mean and standard deviation) given a dataset. Use the following steps:

Generate a synthetic dataset of n = 100 data points by sampling from a Gaussian distribution with a true mean of μ = 50 and a true standard deviation of σ = 5. Hint: Use the function np.random.normal⁷ to sample from a normal distribution.
Calculate the MLE estimates for the mean and standard deviation of your dataset. Hint: The MLE for the mean is the sample mean, and the MLE for the standard deviation is the sample standard deviation.
Compare the estimated parameters with the true parameters used to generate the data.
Plot an histogram of your sampled dataset. Overlay the probability density function of the Gaussian distribution with the estimated parameters on the same plot. Hint: Use the scipy.stats.norm⁸ object and its pdf method.
Explore the effect of sample size on the accuracy of the MLE estimates. Repeat the process with different sample sizes (e.g., n = 10, 50, 100, 500, 1000) and plot how the estimates of the mean and standard deviation vary with sample size.

2.20. (Bayes Error) Write a program to estimate the Bayes error rate for a classification task involving two classes with overlapping distributions. Follow these steps:

Define two normal distributions: 𝒩₁(0, 0.5) and 𝒩₂(1, 1), and plot their probability density functions (PDFs) on the same graph.
Find the intersection point of the two PDFs, corresponding to the decision boundary of the Bayes optimal classifier. Hint: Use scipy.optimize.fsolve⁹ to solve for the points where the two PDFs are equal.
Estimate the Bayes error rate by calculating the area under the curve of each distribution up to the intersection point. Hint: Use scipy.integrate.quad¹⁰ to compute these integrals.

2.21. (Gradient Descent) In this exercise you will implement the gradient descent algorithm (Algorithm E.1), starting from a simple univariate function and then extending it to more complex multivariate functions.

Implement the gradient descent algorithm for univariate functions. Hint: The gradient at point x can be approximated using the finite difference , where h is a small number.
Use your implementation to find the minimum of f(x) = x² + 10 sin x starting from a random point.
Plot the function f(x) and overlay on it the path gradient descent took towards the minimum.
Explore the effect of various learning rates (e.g., 1, 0.1, 0.01, 0.001) on the convergence of the algorithm. Discuss your observations.
Extend your implementation to multivariate functions. For this, modify the gradient computation and update steps to handle vectors instead of scalar values. Hint: You can use the function scipy.optimize.approx_fprime() for finite difference approximation.
Use the extended implementation to find the minimum of the function f(x, y) = x² + y² + 10 sin(xy), starting from a random point in two-dimensional space.
Plot the function f(x, y) in 3D space and overlay on it the path gradient descent took towards the minimum.
Test your algorithm with different learning rates, number of iterations, and starting points.

< Back Page 10 of 10

🔖 Save To Your Account

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.

Email Address