 Introduction
 Traditional Derivation of EM
 An Entropy Interpretation
 DoublyStochastic EM
 Concluding Remarks
3.3 An Entropy Interpretation
The previous section has shown that the EM algorithm is a powerful tool in estimating the parameters of finitemixture models. This is achieved by iteratively maximizing the expectation of the model's completeddata likelihood function. The model's parameters, however, can also be obtained by maximizing an incompletedata likelihood function, leading to an entropy interpretation of the EM algorithm.
3.3.1 IncompleteData Likelihood
The optimal estimates are obtained by maximizing
Define
such that . ^{ [3] } Eq. 3.3.1 becomes
where the first two terms correspond to the Qterm in Eq. 3.2.4 and the second terms corresponds to the Rterm in Eq. 3.2.5. This means the maximization of L can be accomplished by maximizing the completeddata likelihood Q, as well as maximizing an entropy term R.
Now, define so that the likelihood L(Xθ) can be expressed as:
In Eq. 3.3.2, the following three terms have different interpretations:

The first term can be interpreted as the entropy term, which helps induce the membership's fuzziness.

The second term represents the prior information. For each sample x_{t}, this term grasps the influence (prior probability) of its neighboring clusters; the larger the prior probability, the larger the influence.

The third term is the observabledata term, where s(x_{t}, φ^{(}^{j}^{)}) represents the influence of the observable data x_{t} on the total likelihood L.
3.3.2 Simulated Annealing
To control the inference of the entropy terms and the prior information on the total likelihood, one can introduce a temperature parameter σ_{t} similar to simulated annealing; that is,
Maximization of
can be reformulated as the maximization of L_{t} under the constraint that
This is achieved by introducing a Lagrange multiplier λ such that
is to be maximized. To solve this constrained optimization problem, one needs to apply two different kinds of derivatives, as shown here:

that is,

Plugging Eq. 3.3.6 into Eq. 3.3.7 results in
Hence, the optimal membership (Eq. 3.3.6) for each data is
It is interesting to note that both Eqs. 3.3.9 and 3.2.14 have the same "marginalized" form. They can be connected by observing that in the case of mixtureofexperts. As an additional bonus, such a connection leads to a claim that the expectation of hiddenstates (Eq. 3.2.14) provides an optimal membership estimation.
The role of σ_{T} can be illustrated by Figure 3.6. For simplicity, only two clusters are considered and both π^{(1)} and π^{(2)} are initialized to 0.5 before the EM iterations begin. Refer to Figure 3.6(a), where the temperature σ_{T} is extremely high, there exists a major ambiguity between clusters 1 and 2 (i.e., they have almost equivalent probability). This is because according to Eq. 3.3.9, h^{(}^{j}^{)}(x_{t}) ≃ 0.5 at the first few EM iterations when σ_{T} → ∞. When σ_{T} decreases during the course of EM iterations, such ambiguity becomes more resolved—cf. Figure 3.6(b). Finally, when σ_{T} approaches zero, a total "certainty" is reached: the probability that either cluster 1 or 2 will approach 100%—cf. Figure 3.6(c). This can be explained by rewriting Eq. 3.3.9 in the following form (for the case J = 2 and j = 2):
Figure 3.6 This figure demonstrates how the temperature σT can be applied to control the convergence of the EM algorithm.
In Eq. 3.3.10, when σ_{T}→ 0 and s(x_{t}, φ^{(2)}) > s(x_{t}, φ^{(1)}), h^{(}^{2}^{)}(x_{t}) ≃ 1.0, and h^{(}^{1}^{)} (x_{t}) ≃ 0.0. This means that x_{t} is closer to cluster 2 than to cluster 1. Similarly, h^{(2)}(x_{t}) ≃ 0.0 and h^{(1)}(x_{t}) ≃ 1.0 when s(x_{t},φ^{(2)}) < s(x_{t},φ^{(1)}). Therefore, Eq. 3.3.10 suggests that when σ_{T} → 0, there is a harddecision clustering (i.e., with cluster probabilities equal to either 1 or 0). This demonstrates that σ_{T} plays the same role as the temperature parameter in the simulated annealing method. It is a common practice to use annealing temperature schedules to force a more certain classification (i.e., starting with a higher σ_{T} and then gradually decreasing σ_{T} to a lower value as iterations progress).
3.3.3 EM Iterations
Next, the optimization formulation described in Section 3.2 is slightly modified (but causes no net effect). The EM problem can be expressed as one that maximizes L with respect to both (1) the model parameters θ = {θ^{(j)}, ∀j}and (2) the membership function {h^{(}^{j}^{)}(xt),∀t and j}. The interplay of these two sets of variables can hopefully induce a bootstrapping effect facilitating the convergence process. The list that follows further elaborates on this.

In the Estep, while fixing the model parameter {θ = {θ^{(j)}, ∀j}, one can find the best cluster probability h^{(}^{j}^{)}(x_{t}) to optimize L with the constraint , which gave Eq. 3.3.9.

In the Mstep, one searches for the best model parameter θ = {θ^{(j)}, ∀j} that optimizes L, while fixing the cluster probability h^{(}^{j}^{)}(x_{t}), ∀t and j.
3.3.4 Special Case: GMM
When θ defines a GMM, s(x_{t}, φ^{(}^{j}^{)}) becomes
Ignoring terms independent of h^{(j)}(x_{t}), μ^{(j)} Σ^{(j)}_{,} and π^{(}^{j}^{)}, the likelihood function in Eq. 3.3.2 can be rewritten as:
Note that the maximization of Eq. 3.3.12 with respect to θ leads to the same maximum likelihood estimtates as shown in Section 3.2.4.
For RBF and EBFtype likelihood functions, the parameters that maximize s(x_{t}, φ^{(}^{j}^{)}) can be obtained analytically (see Section 3.2.4), which simplifies the optimization process. On the other hand, if a linear model (e.g. LBF) is chosen to parameterize the likelihood, an iterative method is needed to achieve the optimal solutions in the Mstep. In other words, the EM algorithm becomes a doubleloop optimization known as the generalized EM. For example, Jordan and Jacobs [168] applied a Fisher scoring method called iteratively reweighted least squares (IRLS) to train the LBF mixtureofexperts network.
3.3.5 KMeans versus EM
Kmeans [85] and VQ [118] are often used interchangeably: They classify input patterns based on the nearestneighbor rule. The task is to cluster a given data set X = {x_{t}; t = 1,..., T} into K groups, each represented by its centroid denoted by μ^{(j)}_{,} j = 1, . . . , K. The nearestneighbor rule assigns a pattern x to the class associated with its nearest centroid, say μ^{(}^{i}^{)}). Kmeans and VQ have simple learning rules and the classification scheme is straightforward. In Eq. 3.3.12, when h^{(j)}(x_{t}) implements a harddecision scheme—that is, h^{(j)} (x_{t}) = 1 for the members only, otherwise h^{(j)}(x_{t}) = 0—and ∑^{(}^{j}^{)} = c^{2} I b∀j, where c is a constant and I is an identity matrix, the maximization of Eq. 3.3.12 reduces to the minimization of
Therefore, the Kmeans algorithm aims to minimize the sum of squared error with K clusters.
The EM scheme can be seen as a generalized version of Kmeans clustering. In other words, Kmeans clustering is a special case of the EM scheme (cf. Figure 3.2). Table 3.3 summarizes the kinds of learning algorithms that the EM formulation Eq. 3.3.12 can produce.
Table 3.3. Learning algorithms as a result of optimizing Eq. 3.3.12 using different kernel types and decision types. RBF and EBF stand for radial basis functions and elliptical basis functions, respectively. Note that EM types of learning occur whenever the decisions in h^{(j)}(x_{t}) are soft.
Kernel Type 
∑^{(j)} 
h^{(j)}(x_{t}) 
Learning Algorithm 

RBF 
Diagonal 
Hard 
Kmeans with Euclidean distance 
Soft 
EM with Euclidean distance 

EBF 
Nondiagonal, symmetric 
Hard 
Kmeans with Mahalanobis distance 
Soft 
EM with Mahalanobis distance 