Home > Articles

This chapter is from the book

2.3 The Data-Generating Process

In machine learning, we commonly assume that the observed data represents a random sample drawn from some underlying population. Formally, a random sample is a collection of random variables X1, X2, . . . , Xn that are independent and identically distributed (i.i.d.), each having the same distribution as a population variable X (see Section D.2).2

In supervised machine learning, each observation consists of both an input (feature vector) and an output (label). Thus, we model each observation as a pair (x, y) drawn from a joint distribution p(x, y), often referred to as the data-generating distribution. The dataset annot_page52_1.jpg is assumed to consist of i.i.d. samples from this distribution—an assumption known as the i.i.d. assumption.3 This means that:

  • Each pair (xi, yi) is independent of the others, and

  • All pairs are drawn from the same joint distribution p(x, y).

For example, in spam detection, each observation consists of an email represented by features x (such as word frequencies or sender information) and a label y indicating whether the email is spam (y = 1) or not (y = 0). The data-generating process in this context refers to the probabilistic mechanism that produces such pairs (x, y) according to a joint distribution p(x, y), which models both the characteristics of emails and the likelihood of being spam. Under the i.i.d. assumption, the dataset annot_page53_1.jpg consists of independent samples from this distribution, providing the training data from which the model learns to classify new emails.

This probabilistic framing of the learning problem is fundamental in machine learning. It allows us to apply probabilistic tools and techniques to define learning objectives, analyze algorithm performance, and study the relationship between training error and generalization error, under the assumption that both the training and test data are sampled from the same distribution.

Some machine learning algorithms make explicit assumptions about the form of the data-generating distribution. For example, logistic regression assumes that the target label is drawn from a Bernoulli distribution, and Gaussian naive Bayes assumes that each feature is independently drawn from a normal distribution. Other algorithms avoid specifying a particular distribution and instead attempt to learn its underlying structure directly from the data.

In practice, however, the data-generating process may be complex, involving hidden variables or latent factors that are not directly observed. Additionally, the training dataset may not perfectly reflect the true data-generating distribution due to factors such as selection bias, measurement error, or missing data. When such mismatches occur, the model may fail to generalize well to new, unseen data.

2.3.1 Discriminative versus Generative Models

Depending on how they model the data-generating process, classification models are commonly grouped into two broad categories: discriminative models, which directly model the boundaries between classes, and generative models, which model the joint distribution of inputs and labels to capture how the data is generated (see Figure 2.5).

Figure 2.5

Figure 2.5: Discriminative models learn the decision boundaries between the classes, while generative models learn the input distributions in each class.

Discriminative models aim to find a function that discriminates between the different classes by either directly learning the relationship between the input features and the labels y = f(x), or by estimating the conditional probability distribution p(y|x), which provides the likelihood for each class given the input. These models focus on identifying the decision boundary that separates different classes in the input space, rather than modeling the distribution of the input data itself. Examples include logistic regression, k-nearest neighbors, and support vector machines (SVMs).

In contrast, generative models aim to model the data-generating process itself. They estimate the class-conditional distributions p(x|y) and the class prior p(y), capturing how the data is generated in each class. When used for classification, Bayes’ theorem (see Section C.8.5) can then be applied to compute the class posterior probabilities:

Alternatively, generative models can estimate the joint probability distribution p(x, y) directly, and then normalize it (by dividing it by p(x)) to obtain the posterior probabilities p(y|x).

For example, consider an image classification task, where we need to distinguish between images of dogs (y = 1) and images of cats (y = 0). A generative model would first build a model of what dogs look like p(x|y = 1) and what cats look like p(x|y = 0). Then, to classify a new image, it would compare the new image against both these models to determine if it more closely resembles the dog or the cat images seen during training.

In general, generative models provide more information compared to discriminative models, as they learn not only the class probabilities but also the input distribution. This allows them to be used both for classification tasks and for generating new data samples that resemble the training data. For example, in the previous scenario, we could generate new images of dogs by sampling from the learned distribution p(x|y = 1).

On the downside, generative models are more complex to build and train due to the difficulty in learning the densities p(x|y). For example, if the input vector x consists of d binary features, in order to learn p(x|y) we need to estimate 2d conditional probabilities for each class from the data. To mitigate this problem, some assumptions are usually made about the input distribution. For example, naive Bayes models (Chapter 7) assume that the features are conditionally independent given the class.

More advanced machine learning models, such as neural networks, can function as both discriminative and generative models. Generative AI models, such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs), utilize deep neural network architectures for generating new content such as images, video, and audio.

2.3.2 Bayes Error

The ideal machine learning model would have a complete knowledge of the true probability distribution generating the data. However, even such a model may still suffer from some level of error due to inherent noise in that distribution. This is because the mapping from x to y may be inherently stochastic, or y might depend on other variables that are not included in x. The error incurred by a model making predictions from the true distribution p(x, y) is referred to as the Bayes error or irreducible error. This error represents the optimal error rate that could be achieved by any model, even if an infinite amount of data samples is available.

Figure 2.6

Figure 2.6: Coyote (left) and grey wolf (right). (Images: Denis Pepin/Shutterstock, left; Agnieszka Bacal/Shutterstock, right)

For example, imagine that we need to distinguish between wolves and coyotes based only on their height (see Figure 2.6). The shoulder height of adult grey wolves can be modeled by a normal distribution with a mean of 80 cm and a standard deviation of 8 cm, denoted by 𝒩 (80, 8). On the other hand, the shoulder height of coyotes can be modeled by a normal distribution with a mean of 60 cm and a standard deviation of 10 cm, denoted by 𝒩 (60, 10). Figure 2.7 shows the two distributions.

Figure 2.7

Figure 2.7: Probability density functions for the heights of coyotes and wolves. The overlapping region represents the Bayes error—the irreducible error that arises from the inherent ambiguity in distinguishing between the two species based on height alone.

Even a model with perfect knowledge of both distributions will still make errors due to the inherent overlap between them. Let x* be the intersection point of the two distributions (in this example x* = 70.227). The optimal classification model would classify any animal with a height lower than x* as a coyote and any animal with a height larger than x* as a wolf. Therefore, the Bayes error in this case is represented by the area where the two distributions overlap, which can be computed by integrating the conditional probability densities on either side of the intersection point:

The first integral quantifies the probability of misclassifying a wolf as a coyote for all heights below x*, while the second integral quantifies the probability of misclassifying a coyote as a wolf for heights above x*.

More generally, we can write the Bayes error as follows:

which represents the expected probability of misclassification under the optimal (Bayes) classifier. For each input x, the classifier selects the most probable class, and the error is the total probability of all other classes at that point.

InformIT Promotional Mailings & Special Offers

I would like to receive exclusive offers and hear about products from InformIT and its family of brands. I can unsubscribe at any time.