Home > Articles

This chapter is from the book

Training, Validation, and Test Datasets

Training a machine learning model is a process of exposing it to examples so it can learn how inputs relate to desired outputs. This learning phase typically involves feeding the model a large dataset, called the training set, and letting the model adjust internal parameters to minimize error.

Rather than train on all the data at once, most models learn in batches—small groups of data points that are processed together. This progressive process limits the amount of memory needed to train, and it also allows the model to update itself incrementally and refine its performance in steps. A full pass through the training set is called an epoch, and training often involves many such passes.

But models are not judged solely on how well they perform during training. A good model must also generalize, which means it should work well on data it has not seen before. To this end, a dataset is typically split into three parts:

  • Training set: The training set, which is typically 60% to 80% of the dataset, is the largest portion of the dataset. It is used to train and adjust the model parameters.

  • Validation set: The validation set, which is typically 10% to 20% of the dataset, is used during development to fine-tune the model and monitor performance. This set helps to avoid overfitting (which is discussed later in this chapter).

  • Test set: The test set, which is typically 10% to 20% of the dataset, is used only at the final stage to evaluate the model performance on unseen data before deployment.

After the training phase, the next step is to assess how well the model performs by testing it against the validation set. This step helps evaluate the model’s accuracy during development and allows the model designer to adjust parameters that help optimize performance, if needed. The model has never seen the validation data before, and the validation phase simulates how the model might perform on new, unseen inputs.

In most cases, the model will perform slightly worse on the validation set than on the training set. If the difference in performance is small (and you define what qualifies as “small enough”), then the model can be considered acceptable for use. However, if the performance gap is large, the model may not generalize well and likely needs improvement. At that point, you discard the underperforming model, adjust aspects of its design, and train a new version using the same training set. You then evaluate the updated model against the validation set. You repeat this process until you are satisfied with the results or until you decide that the model cannot be improved further.

Figure 2-3 shows this iterative cycle of training and validating.

FIGURE 2.3

Figure 2-3

The training cycle

Once the performance of the model on the validation set is deemed acceptable, the model is likely to be evaluated one final time before production, using the test set, which is a separate portion of the data. Like the validation set, the test set has not previously been used during the training, which means that from the model’s perspective, it is fresh raw data. If the model has been trained properly, its performance on the test set should be comparable to its performance on the validation set.

Dividing a dataset into training, validation, and test sets allows you to measure how well the model is learning and whether it is likely to perform reliably in the real world. If a model performs well on training data but poorly on validation or test data, it may not be learning meaningful patterns but only memorizing specific examples. We will revisit this challenge, known as overfitting, later in the chapter.

Once a model is finalized, it can be deployed and used to make predictions on entirely new real-world data. This stage, where the trained model receives inputs and produces outputs, is called the inference phase.

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.