 Introduction
 Traditional Derivation of EM
 An Entropy Interpretation
 DoublyStochastic EM
 Concluding Remarks
3.5 Concluding Remarks
This chapter has detailed the algorithmic and convergence property of the EM algorithm. The standard EM has also been extended to a more general form called doublystochastic EM. A number of numerical examples were given to explain the algorithm's operation. The following summarizes the EM algorithm:

EM offers an option of "soft" classification.

EM offers a "soft pruning" mechanism. It is important because features with low probability should not be allowed to unduly influence the training of class parameters.

EM naturally accommodates modelbased clustering formulation.

EM allows incorporation of prior information.

EM training algorithm yields probabilistic parameters that are instrumental for media fusion. For linearmedia fusion, EM plays a role in training the weights on the fusion layer. This will be elaborated on in subsequent chapters.
Problems

Assume that you are given a set of onedimensional data X = {0, 1, 2, 3, 4, 3, 4, 5}. Find two cluster centers using

the Kmeans algorithm

the EM algorithm
You may assume that the initial cluster centers are 0 and 5.


Compute the solutions of the singlecluster partialdata problem in Example 2 of Figure 3.3 with the following initial conditions:

In each iteration of the EM algorithm, the maximumlikelihood estimates of an Mcenter GMM are given by
where
X is the set of observed samples. are the maximum likelihood estimates of the last EM iteration, and
State the condition in which the EM algorithm reduces to the Kmeans algorithm.

You are given a set X = {x_{1},..., x_{T}} of T unlabeled samples drawn independently from a population whose density function is approximated by a GMM of the form
where are the means and covariance matrices of the component densities . Assume that π_{j} are known, θ_{j} and θ_{i}(i ≠ j) are functionally independent, and x_{t} are statistically independent.

Show that the maximumlikelihood estimate that maximizes L satisfies the conditions
where is the posterior probability that the ith cluster generates x_{t}.

Hence, show that if are known, the maximumlikelihood estimate , i = 1;...;M are given by

Hence, state the conditions where the equation in Problem 4c reduces to the Kmeans algorithm. State also the condition where the Kmeans algorithm and the equation in Problem 4c give similar solutions.

Based on the normal distribution
in Ddimensions, show that the mean vector and covariance matrix that maximize the loglikelihood function
are, respectively, given by
where X is a set of samples drawn from the population with distribution , N is the number of samples in X, and T denotes matrix transpose.
Hint: Use the derivatives
where A is a symmetric matrix.

Let p(x∑) ≡ N(μ, ∑) be a Ddimensional Gaussian density function with mean vector μ, and covariance matrix ∑. Show that if μ, is known and ∑ is unknown, the maximumlikelihood estimate for ∑ is given by

LBFType EM Methods: The fitness criterion for an RBFtype EM is determined by the closeness of a cluster member to a designated centroid of the cluster. As to its LBFtype counterpart, the fitness criterion hinges on the closeness of a subset of data to a linear plane (more exactly, hyperplane). In an exact fit, an ideal hyperplane is prescribed by a system of linear equations: Ax_{t} + b = 0, where x_{t} is a data point on the plane. When the data are approximated by the hyperplane, then the following fitness function
should approach zero. Sometimes, the data distribution can be better represented by more than one hyperplane. In a multiplane model, the data can be effectively represented by say N hyperplanes, which may be derived by minimizing the following LBF fitness function:
where h^{(}^{i}^{)}(x_{t}) is the membership probability satisfying ∑_{j} h^{(}^{j}^{)}(x_{t}) = 1. If h^{(}^{j}^{)}(x_{t}) implements a harddecision scheme, then h^{(j)}(x_{t}) = 1 for the cluster members only, otherwise h^{(}^{j}^{)}(x_{t}) = 0.

Compare the LBFbased formulation in Eq. 3.5.1 with the RBFbased formulation in Eq. 3.3.13.

Modify an RBFbased EM Matlab code so that it may be applicable to either RBFbased or LBFbased representation.


The following is a useful optimization formulation for the derivation of many EM algorithms. Given N known positive values u_{n}, where n = 1, . . . , N. The problem is to determine the unknown parameters w_{n} to maximize the criterion function
under the constraints w_{n} > 0 and .

Provide a mathematical proof that the optimal parameters have a closedform solution:
Hints: refer to Eqs. 3.2.23 through 3.2.26.

As an application example of the EM formulation in Eq. 3.2.10, what are the parameters corresponding to the known positive values u_{n} and the unknown positive values w_{n}. Hence, provide a physical meaning of the criterion function in Eq. 3.5.2.


As a numerical example of the preceding problem, given u_{1} = 3, u_{2} = 4, and u_{3} = 5, and denote x = u_{1}, y = u_{2}, and 1 − x − y = u_{3}, the criterion function can then be expressed as

Suppose that someone is going to train a GMM by the EM algorithm. Let T denote the number of patterns, M the number of mixtures, and D the feature dimension. Show that the orders of computational complexity (in terms of multiplications) for each epoch in the Estep is O(TMD + TM) and that in the Mstep is O(2TMD).

Assume that you are given the following observed data distribution:
Assume also that when EM begins, n = 0 and

Derive

Substituting X = {1, 2, 4, 8, 9} into your derivation to obtain the corresponding membership values.

Substitute the derived membership values into Eqs. 3.2.17 through 3.2.19, and compute the values of the new parameters θ_{1}.

To do more iterations, one can continue the algorithm by computing Q(θθi), which can again be substituted into Eqs. 3.2.17 through 3.2.19 to obtain θ_{2}. Perform the iterative process until it converges.


It is difficult to provide any definitive assurance on the convergence of the EM algorithm to a global optimal solution. This is especially true when the data vector space has a very high dimension. Fortunately, for many inherently offline applications, there is no pressure to produce results in realtime speed. For such applications, the adoption of a user interface to pinpoint a reasonable initial estimate could prove helpful. To facilitate a visualizationbased user interface, it is important to project from the original tspace (via a discriminant axis) to a twodimensional (or threedimensional) xspace [367, 370] (see Figure 2.7). Create a Matlab program to execute the following steps:

Project the data set onto a reduceddimensional xspace, via say PCA.

Select initial cluster centers in the xspace by user's pinpoint. Based on the userpinpointed membership, perform the EM algorithm in the xspace.

Calculate the values of AIC and MDL to select the number of clusters (see the next problem).

Trace the membership information back to the tspace, and use the membership function as the initial condition and further finetune the GMM clustering by the EM algorithm in the tspace.


One of the most important factors in data clustering is to select the proper number of clusters. Two prominent criteria for such selections are AIC and MDL [5, 314]. From the literature, find out the differences between AIC and MDL criteria. Do you have a preference and why?

Given a set of observed data X, develop Matlab code so that the estimated probability density p(x) can be represented in terms of a set of means and variances.

Use Matlab to create a threecomponent Gaussian mixture distribution with different means and variances for each Gaussian component. Ensure that there is some overlap among the distributions.

Use 2mean, 3mean, and Kmean algorithms to cluster the data.

Compute the likelihood of the true parameters (means and variances that define the three Gaussian components) and the likelihood of your estimates. Which of your estimates is closest to the true distribution in the maximumlikelihood sense?

Compute the symmetric divergence between the true Gaussian distributions and your estimates. Hint: Given two Gaussian distributions Λ_{j} and Λ_{k} with mean vectors μ_{j} and μ_{k} and covariance matrices ∑_{j} and ∑_{k}, their symmetric divergence is
where I is an identity matrix.

Repeat (a), (b), and (c) with the EM clustering algorithm.


Use Matlab to create a mixture density function with three Gaussian component densities. The prior probabilities should be as follows:

P(ω_{1}),) = 0.2, P(ω_{2}) = 0.3, and P(ω_{3}) = 0.5.

P(ω_{l}) = 0.1, P(ω_{2}) = 0.1, and P(ω_{3}) = 0.8.
Use 2mean, 3mean, and 4mean VQ algorithms. Compute the likelihood between the data distribution and your estimate. Repeat the problem with the EM clustering algorithm.
