ExpectationMaximization Theory
3.1 Introduction
Learning networks are commonly categorized in terms of supervised and unsupervised networks. In unsupervised learning, the training set consists of input training patterns only. In contrast, in supervised learning networks, the training data consist of many pairs of input/output patterns. Therefore, the learning process can benefit greatly from the teacher's assistance. In fact, the amount of adjustment of the updating coefficients often depends on the difference between the desired teacher value and the actual response. As demonstrated in Chapter 5, many supervised learning models have been found to be promising for biometric authentication; their implementation often hinges on an effective dataclustering scheme, which is perhaps the most critical component in unsupervised learning methods. This chapter addresses a dataclustering algorithm, called the expectationmaximization (EM) algorithm, when complete or partial information of observed data is made available.
3.1.1 KMeans and VQ algorithms
An effective dataclustering algorithm is known as Kmeans [85], which is very similar to another clustering scheme known as the vector quantization (VQ) algorithm [118]. Both methods classify data patterns based on the nearestneighbor criterion.
Verbally, the problem is to cluster a given data set X = {x_{t}; t = 1, . . , T} into K groups, each represented by its centroid denoted by μ^{(i)}, j = 1, . . , K. The task is (1) to determine the K centroids {μ^{(1)}, μ^{(2)}, . . , μ^{(}^{K}^{)}} and (2) to assign each pattern x_{t} to one of the centroids. The nearestneighbor rule assigns a pattern x to the class associated with its nearest centroid, say μ^{(}^{i}^{)}.
Mathematically speaking, one denotes the centroid associated with x_{t} as μ_{t}, where μ_{t} ∊ {μ^{(1)}, μ^{(2)}, . . , μ^{(}^{K}^{)}}. Then the objective of the Kmeans algorithm is to minimize the following sum of squared errors:
where  ·  is the Euclidean norm.
Let X_{k} denote the set of data patterns associated with the kth cluster with the centroid μ^{(}^{k}^{)} and N_{k} denotes the number of patterns in the cluster X_{k}, where k = 1, . . , K. The learning rule of the Kmeans algorithm consists of the following two basic steps.

Determine the membership of a data pattern:

Updating the representation of the cluster: In a clustering process, the inclusion (or removal) of a new pattern in a cluster (or from a cluster) affects the representation (e.g., the centroid or variance) of the cluster. Therefore, the centroid should be updated based on the new membership:
Sometimes, the variance of the data cluster is also of great interest (e.g., in Gaussian mixture models). In this case, the variance can be computed as
3.1.2 Gaussian Mixture Model
The EM scheme can be seen as a generalized version of Kmeans clustering. The main difference hinges on the notion of a hardversussoft membership. A hard membership is adopted in the Kmeans algorithm, (i.e., a data pattern is assigned to one cluster only). This is not the case with the EM algorithm, where a soft membership is adopted, (i.e., the membership of each data pattern can be distributed over multiple clusters).
The necessity of using a distributed (i.e., soft) membership is the most conspicuous for a Gaussian mixture model (GMM). Given a set of Nindependent and identically distributed patterns X^{(}^{i}^{)} = {x_{t}; t = 1, 2, . . , N} associated with class ω_{i}, the likelihood function p(x_{t}ω_{i}) for class ω_{i} is a mixture of Gaussian distributions; that is,
where Θ_{r}_{}_{i} represents the parameters of the rth mixture component; R is the total number of mixture components; p(x_{t}w_{i}, Θ_{r}_{}_{i}) ≡ N(x; μ_{ri}, ∑_{r}_{}_{i}) is the probability density function of the rth component; and P(Θ_{r}_{}_{i}w_{i}) is the prior probability of the rth component. Typically, N(x; μ_{ri}, ∑_{r}_{}_{i}) is a Gaussian distribution with mean μ_{r}_{}_{i} and covariance ∑_{r}_{}_{i}.
In short, the output of a GMM is the weighted sum of Rcomponent densities. The training of GMMs can be formulated as a maximum likelihood problem, where the mean vectors {μ_{r}_{}_{i}}, covariance matrices {∑_{r}_{}_{i}}, and mixture coefficients {P(Θ_{r}_{}_{i}w_{i})} are often estimated by the iterative EM algorithm—the main topic of the current chapter.
3.1.3 ExpectationMaximization Algorithm
The expectationmaximization (EM) algorithm is an ideal candidate for solving parameter estimation problems for the GMM or other neural networks. In particular, EM is applicable to problems, where the observable data provide only partial information or where some data are "missing"—see Figure 3.1(a). Another important class of parameter estimation that can be addressed by EM involves a mixtureofexperts—see Figure 3.1(b). In this class of problems, there are two categories of unknown parameters: one pertaining to the membership function of an expert (or cluster) and the other consisting of the unknown parameters defining individual experts. Let's use a Gaussian mixture model shown in Figure 3.1(b) as an example, where π^{(}^{i}^{)} denotes the prior probability of expert j and where φ^{(}^{i}^{)} = {μ^{(}^{i}^{)}, ∑^{(}^{i}^{)}} denotes the parameters (mean and variance) of the expert. This chapter explains why the EM method can serve as a powerful tool for estimating these parameters. It also demonstrates how the EM algorithm can be applied to data clustering.
Figure 3.1 Parameter estimation by EM. (a) EM for general missing data problems, where θ is the nonstructural parameters to be estimated and Z is the set of missing data. (b) EM for hiddenstate problems in which the parameter θ can be divided into two groups: and , where π(i) represents the prior probability of the jth expert and φ(i) defines the density function associated with the jth expert.
The EM algorithm is a very general parameter estimation method in that it is applicable to many statistical models, for example, mixtureofexperts (MOE), Gaussian mixture models (GMMs), and vector quantization (VQ). Figure 3.2 depicts the relationship among EM, MOE, GMM, and VQ. In particular, the figure highlights the fact that VQ is a special case of GMM, which in turn is a special case of the more general mixtureofexperts. More important, EM is applicable to all of these models.
Figure 3.2 Diagram depicting the relationship among EM, MOE, GMM, VQ and the class of problems known as missing and partialdata problems.
The classic EM algorithm can be dated back to Dempster, Laird, and Rubin's paper in 1977 [74]. It is a special kind of quasiNewton algorithm with a searching direction having a positive projection on the gradient of loglikelihood. Each EM iteration consists of two steps—Estimation (E) and Maximization (M). The Mstep maximizes a likelihood function that is refined in each iteration by the Estep. Interested readers can refer to the references [74, 168, 297, 350]
One important feature of the EM algorithm is that it can be applied to problems in which observed data provide "partial" information only or when artificially introducing some information (referred to as "hidden"state information hereafter) can greatly simplify the parameter estimation process. Figure 3.3 illustrates the concept of hidden and partial data. In Figure 3.3(a), all data (x_{1} to x_{7}) are known.
Figure 3.3 Onedimensional example illustrating the concept of (a) hiddenstate, (b) partialdata, and (c) combined partialdata and hiddenstate. In (a) the information regarding the cluster membership of xt is hidden; in (b) y is partial in that its exact value is unknown; and in (c) data xt provide partial information only because none of their exact values are known. The cluster membership information is also hidden.
Let's assume that there are two clusters in the observed data. Although all data constituting the two clusters are observable, one does not know exactly to which cluster each of these data belongs. Lacking this hidden membership information results in a complicated parameter estimation procedure. The estimation procedure, however, can be greatly simplified if this membership information is assumed to be known. For example, if the cluster identities of x_{1} to x_{7} in Figure 3.3(a) were known, finding the cluster means is reduced to computing the mean of individual clusters separately. Figure 3.3(b) illustrates the idea of partial data. Unlike Figure 3.3(a), the partialdata problem in Figure 3.3(b) contains uncertain data y because y can be equal to 5.0 or 6.0. As a result, the true value of y is unobservable whereas those of x_{l} to x_{4} are observable. The EM algorithm can solve this partialdata problem effectively by computing the expected value of y. Figure 3.3(c) illustrates the case in which cluster membership information is hidden and only partial information is available. The problem can be viewed as a generalization of the problems in Figure 3.3(a) and Figure 3.3(b). A new type of EM called doublystochastic EM is derived in Section 3.4 to address this kind of general problem. Numerical solutions for the problems in Figure 3.3 are provided in later sections.
The concepts of hidden and partial data have been applied to many scientific and engineering applications. For instance, in digital communication, the receiver receives a sequence consisting of +1's and — 1's without knowing which bit in the sequence is a +1 and which bit is a —1. In such cases, the state of each bit constitutes the missing information. In biometric applications, a MOE is typically applied to model the features of an individual. Each expert is designed to model some of the userspecific features. In such cases, the contribution of individual experts constitutes the hidden information.
EM has been shown to have favorable convergence properties, automatical satisfaction of constraints, and fast convergence. The next section explains the traditional approach to deriving the EM algorithm and proving its convergence property. Section 3.3 covers the interpretion the EM algorithm as the maximization of two quantities: the entropy and the expectation of completedata likelihood. Then, the Kmeans algorithm and the EM algorithm are compared. The conditions under which the EM algorithm is reduced to the Kmeans are also explained. The discussion in Section 3.4 generalizes the EM algorithm described in Sections 3.2 and 3.3 to problems with partialdata and hiddenstate. We refer to this new type of EM as the doubly stochastic EM. Finally, the chapter is concluded in Section 3.5.