<aside>
🎯 Goal
- We have a generative process as follows: $\theta \rightarrow Z \rightarrow X$
- Here, $X$ is observed and $Z$ is latent
- Estimate $\theta$ that maximizes the likelihood of observed data.
</aside>
- Look at an example to better understand the problem
- Let’s look at more applications and map them to the same example above
- Why is this not just a maximum likelihood estimation problem?
<aside>
☝ Dear 434 students:
I was deliberately slow in our lecture yesterday (April 30) because I believe understanding the EM problem formulation is the most crucial part.
Why? Because only if you understand the formulation well, you will be able to map EM to various real-world problems that you face in industry or grad school.
So I suggest you brainstorm about other problems where you might be able to apply EM; this exercise will give you clarity on what EM is really capable of doing.
Ok, that said … I wanted to finish yesterday’s lecture in this email by telling you at a (very) high level how the algorithm works (the math is in the notes below).
- Computing likelihood is difficult because we don’t know what values the $z$ variable is taking.
Hence, we decided to make $z$ a random variable … and give probabilities to each possible assignment of $z$.
In other words, $z$ is no longer a one-hot vector, instead it takes all the values with some probability → its a distribution.
- If I give you the value of $z$, then you know which $\theta$ to use (for example, if I tell you $z_i = 5$, then you know that the 5th coin was tossed, so you know that you should use model parameter $\theta_5$).
This means that you can calculate likelihood $p(x_i | \theta_i)$.
Now, since I am giving you $z$ with some probability (drawn from the distribution), then you can compute the expected likelihood.
Essentially this expected likelihood $= ~ p(z_i = 1)p(x_i | \theta_1) + ~~p(z_i = 2)p(x_i | \theta_2) + ~~\dots ~~
p(z_i = k)p(x_i | \theta_k)$
- So two questions remain:
1/ What is the distribution on $z$?
2/ We don’t know values for any $\theta$ … so what do we use even if someone tells us $z$.
- Ok, let’s start with some arbitrary values of $\theta$’s … you can pick any value as a starting point.
We will see over time that EM will update the values of $\theta$ towards the optimal.
- And for the distribution on $z$, we will compute the posterior distribution of $z$ … which is $p(z|x)$.
Observe that this is exactly like our balls and bins problem … where we were asking, given a red ball, what is the probability that it came from bin #1 or bin #2 and so on.
Similarly, here we are asking, given a toss outcome (heads or tails), what is the probability the coin that was used was the first coin, or second coin, and so on.
- This posterior can be computed since Bayes rule will transform this into all likelihoods and priors:
For balls and bins, we have: $p(bin_1 ~|~ red~ball ) = p(red~ball ~|~ bin_1) p(bin_1) / \sum_j p(red~ball ~|~ bin_j) p(bin_j)$
For our coin toss, we have: $p(z_i = 1 ~|~ x_i ) = \frac{p(x_i ~|~ z_i =1) p(z_i=1)}{ \sum_j p(x_i ~|~ z_i=j) p(z_i = j) }$
$p(z_i = 2 ~|~ x_i ) ~~=~~ \\frac{p(x_i ~|~ z_i =2) p(z_i=2)}{ \\sum_j p(x_i ~|~ z_i=j) p(z_i = j) }$
… and so on till …
$p(z_i = k ~|~ x_i ) ~~=~~ \\frac{p(x_i ~|~ z_i =k) p(z_i=k)}{ \\sum_j p(x_i ~|~ z_i=j) p(z_i = j) }$
- Ok, we have the posterior distribution on $z$ now, denoted as $q(z)$ … and hence we can compute the expected likelihood. We are in business.
- Now, let’s maximize the expected likelihood w.r.t. $\theta$.
In plain english, this maximization is asking: what biases for each coin will give us the maximum expected likelihood.
Said even differently, what biases of the coins best agree with the observed coin-toss outcomes ($x_{1:n}$).
Let’s say this optimal $\theta$ for the first round is denoted as $\theta^{1}$
- Ok, recap now:
- you started with some arbitrary guess of $\theta$ … let’s call it $\theta^0$ … and you computed the posterior from it, say $q^0(z)$…
- you then used this posterior $q^0(z)$ to compute the expected likelihood function …
- and then you maximized the likelihood w.r.t. $\theta$ to get more informed values of $\theta^1$
- Ok then, keep going ….
- Use $\theta^1$ to compute a better posterior $q^1(z)$.
- Use $q^1(z)$ to compute a new expected likelihood.
- Maximized this likelihood w.r.t. $\theta$ to get even better values of $\theta^2$.
- Repeat.
- Stop the repeat when the values of $\theta$ are hardly changing any more, i.e., $| \theta^{t+1} - \theta^{t} | \leq \epsilon$
- This is when EM has reached convergence.
- Last step:
- Output $\theta_{1:k}$ as the last values of $\theta$’s that you have after convergence.
- Output for each $x_i$ the posterior distribution of $z_i$ … which you know.
If your boss says you have to give a one-hot vector for $z_i$, then pick the value of $z_i$ that has maximum probability in the posterior and output that.
YOU ARE DONE.
p.s. I might have incorrectly written the LOTUS equation in class. Here is the correct LOTUS equation:
-
say random variable $X$ is distributed as $f_X(x)$.
-
say random variable $Z = g(X)$ … where $g(.)$ is some function.
-
Then, LOUS allows us to easily calculate expected value of $Z$ as: $E[Z] = \int g(X)f_X(x)dx$
-
This is relevant in EM because you can think of $g(.)$ as the likelihood function of $z$ and $f(.)$ as the posterior on $z$
-
thus you can easily calculate the expected likelihood.
</aside>
-
The insight in plain language
-
Set up the likelihood function
-
Expectation step → create the posterior distribution
-
Maximization step → take the expectation of the likelihood over the posterior on $Z$
-
The final iteration
<aside>
<img src="/icons/library_red.svg" alt="/icons/library_red.svg" width="40px" /> Tutorial by Dahua Lin (MIT): here
</aside>