EM: Expectation Maximization for ECE/CS 434 (Spring 2024)

<aside> 🎯 Goal

We have a generative process as follows: $\theta \rightarrow Z \rightarrow X$
Here, $X$ is observed and $Z$ is latent
Estimate $\theta$ that maximizes the likelihood of observed data.

</aside>

Look at an example to better understand the problem
Let’s look at more applications and map them to the same example above
Why is this not just a maximum likelihood estimation problem?

<aside> ☝ Dear 434 students:

I was deliberately slow in our lecture yesterday (April 30) because I believe understanding the EM problem formulation is the most crucial part. Why? Because only if you understand the formulation well, you will be able to map EM to various real-world problems that you face in industry or grad school. So I suggest you brainstorm about other problems where you might be able to apply EM; this exercise will give you clarity on what EM is really capable of doing.

Ok, that said … I wanted to finish yesterday’s lecture in this email by telling you at a (very) high level how the algorithm works (the math is in the notes below).

Computing likelihood is difficult because we don’t know what values the $z$ variable is taking. Hence, we decided to make $z$ a random variable … and give probabilities to each possible assignment of $z$. In other words, $z$ is no longer a one-hot vector, instead it takes all the values with some probability → its a distribution.
If I give you the value of $z$, then you know which $\theta$ to use (for example, if I tell you $z_i = 5$, then you know that the 5th coin was tossed, so you know that you should use model parameter $\theta_5$). This means that you can calculate likelihood $p(x_i | \theta_i)$. Now, since I am giving you $z$ with some probability (drawn from the distribution), then you can compute the expected likelihood. Essentially this expected likelihood $= ~ p(z_i = 1)p(x_i | \theta_1) + ~~p(z_i = 2)p(x_i | \theta_2) + ~~\dots ~~ p(z_i = k)p(x_i | \theta_k)$
So two questions remain:
1/ What is the distribution on $z$? 2/ We don’t know values for any $\theta$ … so what do we use even if someone tells us $z$.
Ok, let’s start with some arbitrary values of $\theta$’s … you can pick any value as a starting point. We will see over time that EM will update the values of $\theta$ towards the optimal.
And for the distribution on $z$, we will compute the posterior distribution of $z$ … which is $p(z|x)$. Observe that this is exactly like our balls and bins problem … where we were asking, given a red ball, what is the probability that it came from bin #1 or bin #2 and so on. Similarly, here we are asking, given a toss outcome (heads or tails), what is the probability the coin that was used was the first coin, or second coin, and so on.
This posterior can be computed since Bayes rule will transform this into all likelihoods and priors:

For balls and bins, we have: $p(bin_1 ~|~ red~ball ) = p(red~ball ~|~ bin_1) p(bin_1) / \sum_j p(red~ball ~|~ bin_j) p(bin_j)$

For our coin toss, we have: $p(z_i = 1 ~|~ x_i ) = \frac{p(x_i ~|~ z_i =1) p(z_i=1)}{ \sum_j p(x_i ~|~ z_i=j) p(z_i = j) }$

                                             $p(z_i = 2 ~|~ x_i ) ~~=~~ \\frac{p(x_i ~|~ z_i =2) p(z_i=2)}{  \\sum_j p(x_i ~|~ z_i=j) p(z_i = j) }$

                                             … and so on till …

                                             $p(z_i = k ~|~ x_i ) ~~=~~ \\frac{p(x_i ~|~ z_i =k) p(z_i=k)}{  \\sum_j p(x_i ~|~ z_i=j) p(z_i = j) }$

Ok, we have the posterior distribution on $z$ now, denoted as $q(z)$ … and hence we can compute the expected likelihood. We are in business.
Now, let’s maximize the expected likelihood w.r.t. $\theta$. In plain english, this maximization is asking: what biases for each coin will give us the maximum expected likelihood. Said even differently, what biases of the coins best agree with the observed coin-toss outcomes ($x_{1:n}$). Let’s say this optimal $\theta$ for the first round is denoted as $\theta^{1}$
Ok, recap now:
you started with some arbitrary guess of $\theta$ … let’s call it $\theta^0$ … and you computed the posterior from it, say $q^0(z)$…
you then used this posterior $q^0(z)$ to compute the expected likelihood function …
and then you maximized the likelihood w.r.t. $\theta$ to get more informed values of $\theta^1$
Ok then, keep going ….
Use $\theta^1$ to compute a better posterior $q^1(z)$.
Use $q^1(z)$ to compute a new expected likelihood.
Maximized this likelihood w.r.t. $\theta$ to get even better values of $\theta^2$.
Repeat.
Stop the repeat when the values of $\theta$ are hardly changing any more, i.e., $| \theta^{t+1} - \theta^{t} | \leq \epsilon$
This is when EM has reached convergence.
Last step:

Output $\theta_{1:k}$ as the last values of $\theta$’s that you have after convergence.
Output for each $x_i$ the posterior distribution of $z_i$ … which you know. If your boss says you have to give a one-hot vector for $z_i$, then pick the value of $z_i$ that has maximum probability in the posterior and output that.

YOU ARE DONE.

p.s. I might have incorrectly written the LOTUS equation in class. Here is the correct LOTUS equation:

say random variable $X$ is distributed as $f_X(x)$.
say random variable $Z = g(X)$ … where $g(.)$ is some function.
Then, LOUS allows us to easily calculate expected value of $Z$ as: $E[Z] = \int g(X)f_X(x)dx$
This is relevant in EM because you can think of $g(.)$ as the likelihood function of $z$ and $f(.)$ as the posterior on $z$
thus you can easily calculate the expected likelihood. </aside>
The insight in plain language
Set up the likelihood function
Expectation step → create the posterior distribution
Maximization step → take the expectation of the likelihood over the posterior on $Z$
The final iteration

<aside> <img src="/icons/library_red.svg" alt="/icons/library_red.svg" width="40px" /> Tutorial by Dahua Lin (MIT): here

</aside>