How Models Are Trained
In the examples we have explored so far, models include parameters, such as a, b, or a1, a2, etc., which are symbolic placeholders for real numeric values. The primary goal of machine learning is to learn these values during a process called training. Training involves teaching a model to recognize patterns, make predictions, or generate new content based on a dataset. It involves iteratively adjusting the model parameters to optimize its performance on a specific task to minimize the difference between the model’s predictions and the actual outcomes.
Training involves feeding a large set of data (called the training set) into the model. In the case of modeling how long it takes a car to stop, the training set would be a large collection of observed value pairs {speed, time to stop}, and training would involve asking the computer to find the best parameters that represent the relationship between the speeds and the times to stop.
This learning phase is somewhat like a brute-force process, where the program tries a large set of numbers until it finds the “least bad” values for the parameters to model the data. We use the expression least bad because in most cases, the training phase does not find perfect numbers: It simply finds the best numbers that can be inferred from the training data. In the car example, different cars with different brakes, tires, or shock absorbers may take different times to come to a complete stop, even when starting from the same speed, which means the final parameters will have a margin of error. They will represent the behavior of most of the cars. The parameters may not be perfect for each and every car, but they are the best ones to model the training set.
In practical terms, the real world is “noisy”: Unknown or unmeasured variables influence the outcomes of experiments. These hidden variables are not included in the dataset because they are unknown or nearly impossible to measure. As a result, the experimenter only records observed variables—variables that are known and accounted for, such as speed. Because of this limitation, a model typically cannot learn its parameters perfectly. Instead, it estimates parameters (such as coefficients a and b) that work best on average, minimizing the gap between the predicted outcomes and the actual measurements. The model then outputs a likely, calculated value of y that is close to, but rarely exactly the same as, the real-world value of y observed during the experiment.
The goal of the model is not to match every individual (and potentially noisy) data point but rather to find the general relationship between inputs and outputs. For example, even if the stopping time of a car varies slightly due to unmeasured conditions, the model still allows you to predict the approximate time the car will take to come to a full stop. Naturally, the more observed variables your model includes, the more accurate its predictions are likely to be.
A measurement of the difference between the calculated (predicted) y and the real value is called the loss. This loss can be determined in a few different ways, but it is generally represented by a loss function. The goal of the training phase is to continually refine the parameters in a way that minimizes loss, allowing the model to accurately reflect the input and output relationship.
The details of how the loss function is minimized depend on the type of machine learning technique that is in play. However, all training phases have some common traits. In general, the larger the training set, the better the model. This is because the presence of many noisy data points tends to average out over time, allowing the relevant patterns to emerge more clearly. It is a principle we encounter in everyday life. For example, if you show someone just five pictures of cats, they might conclude that the defining traits of a cat are simply fur and pointy ears. But those features also describe many dogs. With more examples, the distinctive features of cats become easier to identify.
Similarly, if someone tries to cook a dessert, such as crème brûlée, after tasting it only twice, their results are unlikely to be great. But after many tastings, they start to pick up on the dessert’s defining characteristics and notice that it is not just sweet and creamy but also has a caramelized crust, a hint of vanilla, a custard texture, and so on. Machine learning follows the same principle: More data allows the model to distinguish meaningful traits more accurately.
Larger and varied datasets are generally more desirable, but a larger training set also comes with a downside: It takes longer to train, which could mean increased cost (in terms of GPU, power, the cost of waiting to see if the training worked, and so on) The training phase is often a balancing act between the desired accuracy of the model (“as good as possible” is usually too vague) and the expected training time (“as fast as possible” is also too vague). There are many tools dedicated to estimating the training time and measuring the accuracy of a trained model.
