2.2 Machine Learning Models
Machine learning models are programs or systems that learn from data in order to make predictions. They are built using machine learning algorithms that fit the model parameters to the observed data. The main components of a machine learning model include:
Data: The dataset used to train and evaluate the model. It is typically divided into three subsets:
– The training set is used to train the model.
– The validation set helps in tuning and validating the model’s performance.
– The test set provides an unbiased evaluation of the final model.
Learning algorithm: The method used to train the model. The choice of algorithm depends on the type of task (e.g., classification or regression) and the nature of the data. Learning algorithms are discussed in Section 2.2.1.
Model parameters: Parameters are the components of the model that are learned from the training data, e.g., the coefficients in a linear regression model (see Section 2.2.2).
Loss function: The loss function measures how well the model’s predictions match the actual data. The goal of training is to minimize this function. Loss functions are discussed in Section 2.2.3.
Learning process: This process involves using a learning algorithm to optimize the model parameters in order to best fit the training data, according to the chosen loss function.
Optimization techniques: Methods that are used to find the optimal parameters of the model by minimizing the loss function, such as gradient descent (see Section 2.2.4).
Hyperparameters: Unlike parameters, hyperparameters are not learned from the data. They are set prior to the learning process and control the model’s structure or learning behavior (see Section 2.2.5).
Evaluation metrics: Metrics that are used to assess the model’s performance after training (see Section 2.2.6).
Figure 2.3: Commonly used supervised machine learning algorithms. Starting from top-left and going clockwise: Logistic regression, k-nearest neighbors, neural networks, and decision trees.
2.2.1 Learning Algorithms
A wide range of supervised learning algorithms are available, each with its own strengths and weaknesses (see Figure 2.3). The most commonly used supervised learning algorithms are:
Linear regression (Chapter 4)
Logistic regression (Chapter 5)
K-nearest neighbors (Chapter 6)
Naive Bayes (Chapter 7)
Decision trees (Chapter 8)
Support vector machines (Chapter 11)
Neural networks (covered in Volume II)
Some of these algorithms can handle both regression and classification tasks, while others can handle only one type of problems (e.g., Naive Bayes can only handle classification problems). Ensemble methods (see Chapter 9), such as random forests and gradient boosting, combine together multiple algorithms to create more powerful models.
The no free lunch theorem [630] states that no single learning algorithm is the best for all problems. More specifically, any two learning algorithms perform equally well when their performance is averaged over all possible data distributions. This theorem underscores the importance of choosing algorithms that are well-suited to the specific characteristics of the problem at hand.
Another general principle known as Occam’s razor [57] suggests that if two models have similar performance and they explain the known observations equally well, the simpler model should be preferred (e.g., the model with fewer parameters or less assumptions on the domain). This principle, which is widely adopted in various scientific disciplines, emphasizes simplicity and parsimony in explanatory models.
2.2.2 Model Parameters
Many machine learning models are parametric models, i.e., they have a set of learnable parameters that are optimized to best fit the training data. This set of parameters is typically denoted by θ or w, and the model’s hypothesis, which depends on these parameters, is often written as h(x; θ) or hθ(x).
For example, in linear regression, the model’s hypothesis is a function of the form h(x) = w0+w1x1 +. . .+wdxd, and the model parameters are the coefficients (or weights) of the features: w = (w0, . . . , wd)T. In neural networks, the parameters are the weights of the connections between neurons in the network and the neuron biases.
The number of parameters is strongly correlated with the model complexity. If a model has too few parameters, it may fail to capture the complexity of the data, resulting in poor performance. On the other hand, a model that has too many parameters or its parameters are too finely tuned to the training data may perform very well on the training data but poorly on new, unseen data (a phenomenon known as overfitting).
Nonparametric models, such as k-nearest neighbors and decision trees, do not have a fixed number of parameters. This flexibility allows them to adapt more freely to the intricacies of the data and capture complex patterns that parametric models might miss. On the other hand, these models usually require more data to make accurate predictions and tend to be more prone to overfitting.
2.2.3 Loss Functions
An important part of building a supervised learning model is selecting an appropriate loss function. A loss function, denoted by L(y, ŷ), measures the error between the model’s output for a given input ŷ = h(x) and the true label y = f(x). For parametric models with a parameter vector θ, it is common to denote the loss function as L(y, h(x); θ), or L(θ) for short, to underscore the function’s dependence on the model parameters.
The loss function guides the optimization process of the model and determines how it adjusts its parameters during training to minimize the prediction error. Consequently, different loss functions can lead to different behaviors of the model. For example, some loss functions prioritize overall accuracy, while others focus on robustness to outliers or improved performance on under-represented classes.
Desirable properties of a loss function include:
Task-specific relevance: The loss function should align with the specific objective of the task the model is trying to solve. For example, in regression tasks, a common loss function is the squared loss (y–h(x))2, which penalizes the squared difference between the predicted value and the actual value (see Section 4.2).
Symmetry: The loss should be the same for an error above or below the target value.
Differentiability: Most optimization algorithms used in machine learning require the loss function to be continuous and differentiable, with some requiring it to be twice differentiable.
Convexity: Convex functions (see Section B.11.2) have the desirable property that any local minimum is also a global minimum, which makes them easier to optimize.
Computationally efficient: The loss function should be fast to compute, especially when dealing with large datasets.
2.2.4 Optimization
In supervised learning, the learning process typically involves finding the set of model parameters θ that minimizes the error on the training set. This turns the learning task into an optimization problem whose objective is to minimize a cost function—typically defined as the average loss over all training samples.
Most machine learning models do not admit closed-form solutions for this optimization problem, especially when the loss function is non-convex or the model is complex. In such cases, optimization is usually performed using iterative algorithms that update the parameters incrementally to reduce the error. The choice of optimization method plays a crucial role in the efficiency and success of the learning process.
Gradient-based methods, such as gradient descent and its variants, are among the most widely used optimization techniques in machine learning. These methods iteratively adjust the parameters in the direction of the negative gradient of the loss function. In some cases, second-order methods like Newton’s method, which leverage curvature information via second derivatives, can achieve faster convergence. Constrained optimization techniques are also used when the learning task involves additional constraints, as in the case of support vector machines.
These and other optimization methods are discussed in detail in Appendix E. Readers unfamiliar with optimization are strongly encouraged to review this appendix to gain a deeper understanding of the mathematical tools that underpin many of the algorithms described in this book.
Optimization by itself is a challenging problem with no single method that works best in all situations. In machine learning, this challenge is further compounded by the fact that the true objective is not merely to minimize the training error, but to reduce the generalization error, that is, the model’s performance on unseen data. Unlike the training error, which can be directly measured, generalization error is inherently unobservable during training, adding an additional layer of complexity to the learning process.
2.2.5 Hyperparameters
Hyperparameters are configurable settings of the learning algorithm that define and control its behavior. They are set prior to the training process, in contrast to model parameters, which are learned from the data during training. Common examples of hyperparameters include the learning rate in gradient-based algorithms, and the number of hidden layers and neurons in neural networks.
Hyperparameter tuning is the search for the combination of hyperparameter values that yield the best performance of the model. This process is a critical step in the machine learning pipeline, as improperly tuned hyperparameters can lead to models that either perform poorly or overfit to the training data. For example, setting a learning rate that is too high in gradient descent can cause the model to oscillate around the minimum point without converging.
Hyperparameter tuning techniques range from manual trial-and-error experimentation to more principled methods such as grid search or random search, which systematically explore the hyper-parameter space in order to find the most effective configuration settings (see Section 3.8).
When tuning hyperparameters, it is essential to evaluate the model’s performance on an independent dataset, separate from the training and test sets. Using the test set for repeated evaluations during model development or hyperparameter tuning can contaminate the test data, leading the model to become overfitted to it. As a result, the model may end up working well on the test set, but not on truly unseen data.
Therefore, it is a common practice to divide the available training data into two disjoint subsets: a training set and a validation set. The model is trained on the training set using the chosen hyperparameters, and then its performance is evaluated on the validation set. The model’s performance on the validation set provides an estimate of how the model will perform on unseen data, guiding the selection of the best hyperparameters.
A common split for dividing the dataset is a 70-20-10 split, where 70% of the data is used for training the model, 20% is used for validation, and 10% is reserved for testing the model after training and tuning (see Figure 2.4). However, these percentages may vary based on the size and characteristics of the dataset.
Figure 2.4: A typical 70-20-10 partition of the dataset into training, validation, and test sets. The training set is used to fit the model, the validation set helps tune hyperparameters and select the best model, and the test set provides an unbiased evaluation of the final model’s performance.
Using a single validation set can sometimes result in a biased evaluation, especially if this set is small or not representative of the overall data. Cross-validation is a technique that addresses this issue by dividing the training data into multiple subsets (or folds) [79]. The model is then trained and validated multiple times, each time using a different subset as the validation set and the rest as the training set. This process yields multiple performance metrics (such as accuracy or error rate), which are averaged to get a more reliable estimate of the model’s performance. Cross-validation is discussed in more detail in Section 3.7.
2.2.6 Evaluation Metrics
Evaluation metrics are used to evaluate the performance of the model after training. In contrast to cost functions that define the model’s training objective and guide its optimization process, evaluation metrics are applied post-training to evaluate the model’s performance on the validation or test sets. Evaluation metrics allow us to compare the performance of different models or assess the same model under varying hyperparameters. Different machine learning tasks use different evaluation metrics:
For regression tasks, common metrics include root mean squared error (RMSE), mean absolute error (MAE), and the R2 score (see Section 4.5). These metrics evaluate the difference between the predicted values and the actual target values.
For classification tasks, metrics like accuracy, precision, recall, F1 score, and the area under the ROC curve (AUC) are widely used (see Section 5.5). These metrics provide different insights into the classifier’s performance, such as its overall correctness or its ability to minimize false positives or false negatives.
In addition to the standard metrics, it is often necessary to develop custom metrics tailored to the specific requirements or business objectives of the problem domain. For example, when developing a recommender system for an e-commerce website, we may be interested in evaluating not only the accuracy of the system’s recommendations, but also how much these recommendations lead to increased sales. In this case, we may develop a custom metric such as sales conversion rate of recommended products, which measures the percentage of recommended products that resulted in a purchase.
