How to Measure Model Accuracy
Accuracy depends greatly on the type of task. In the example of predicting the braking time of a car, you do not expect the model to be perfect (because the world is noisy). However, you want the model-predicted outputs to be as close as possible to the actual observed values. Consider a variation where the model is designed to predict whether a car has enough time to stop before reaching a stop sign. The difference between the model prediction and the real-world stopping times can be measured using the training data. This difference gives you an estimate of the model’s accuracy or, more precisely, its error. For example, you might find that the model predicts stopping time with an average error of ±5% compared to the actual observed times. In other cases, the model may be used for a classification task, such as predicting whether a car should issue a driver alert based on current speed and driving conditions. For example, imagine a system that monitors the car speed, distance to the vehicle ahead, lighting conditions, rate of deceleration, weather conditions, and so on. Its goal is to classify the situation as either “safe” or “issue alert.” The model might predict: “At this speed and following distance, is there a high likelihood of needing to warn the driver?”
The model accuracy refers to how often it correctly makes this decision. If it predicts that an alert should be issued, did the situation actually warrant one, or was it a false positive? If the model has only 50% accuracy, it is no better than flipping a coin: It fails to provide any meaningful guidance. In contrast, an 80% accuracy means the model is right four times out of five but still makes errors in 20% of cases. Depending on the consequences, this may or may not be acceptable. For a life-critical system like collision avoidance, that error rate might be too high. But for noncritical systems, like adjusting cruise control behavior or suggesting breaks during long drives, 80% might be considered acceptable, especially compared to random guessing.
Ultimately, the acceptable level of accuracy depends on the context and the cost of being wrong. In high-risk scenarios, even a seemingly good model might not be reliable enough without further fine-tuning or safeguards.
When training an AI model, you may achieve a higher accuracy rate, but the importance of that accuracy depends on the consequences of being wrong. If the model occasionally issues an alert when none is needed (a false positive), the result may be mild annoyance for the driver. Perhaps they slow down unnecessarily or dismiss the warning. But if the model fails to issue an alert when a real hazard is present (a false negative), the consequences could be severe and result in a collision with another vehicle or an obstacle. This issue highlights a core principle in AI: Acceptable accuracy depends on the cost of mistakes.
The accuracy of a model must be measured holistically and not just on the training, validation, or test sets alone. It must also be measured on new data during the inference phase once the model is deployed in the wild. One common issue in the development of AI projects is developing a model that is too simple and fails to capture important relationships in the data (for example, trying to fit a straight line to data that actually follows a more complex curve). Intuitively, this is like trying to explain a complex abstract subject in the language of a 5-year-old. You might get the essence of the idea across, but you will miss a tremendous amount of detail. This situation is known as underfitting. An underfit model typically shows poor accuracy across the board—on training, validation, test, and real-world data—making it a weak predictor.
At the other end of the spectrum, you might overcompensate by tweaking your model with so much detail that the model tries to match the training data too closely, capturing noise and irregularities that are not part of the underlying pattern. This is called overfitting. While the model may show high accuracy on the training set, and sometimes even on the validation and test sets, it often performs poorly when faced with new data it has not seen before during inference.
An ideal model strikes a balance. It generalizes well, producing similar accuracy across validation, test, and real-world data. In other words, a good model does not just fit the training data; it learns the underlying patterns well enough to generalize reliable predictions. Figure 2-4 illustrates overfitting, underfitting, and fitting that is just right.
How accurate a model needs to be often touches fields beyond strict machine learning. For example, imagine a company that installs an app on employees’ phones to detect inappropriate content and automatically file a public police report when it finds such content. In this case, even 95% accuracy could be problematic because it means that for every 1,000 reported violations, 50 would be completely innocent. Is such a number acceptable? The answer depends on corporate policy and culture, which are elements that are beyond the field of AI and machine learning.

