Gibbs Sampling for the Uninitiated
(Resnik and Hardisty, 2010) at
Kevin Knight’s tutorial on Bayesian methods is one of the most approachable, human pieces of writing to describe a highfalutin concept. This technical report from the University of Maryland at College Park applies that style to Gibbs sampling. It’s less afraid to introduce maths, and it still grounds the problems in our intuitions as NLP researchers.
The first idea that we’re exposed to is the difference between maximum likelihood estimation (MLE) and maximum a posteriori estimation (MAP). The latter seeks the argmax
of the posterior, which is equivalent to the argmax of the likelihood times the prior. Bayes’s rule gives this relationship.
\[
\hat{\pi}_{MLE} = \arg\max_\pi \Pr(\mathcal{X} | \pi)
\]
\[\Pr(y | \mathcal{X}) \approx P(y | \hat{pi}_{MLE})\]
\[
\hat{\pi}_{MAP} = \arg\max_\pi \Pr(\mathcal{X} | \pi) \Pr(\pi)
\]
\[
\Pr(y | \mathcal{X}) \approx P(y | \hat{\pi}_{MAP})
\]
Next, we’re introduced to the question of sampling. Why do we do it? Because most integrals are difficult to calculate, so most probability distributions are hard to compute. Instead, we can get something asymptotically close to \(\Pr(\pi | \mathcal{X})\). We want, for whatever reason, to sample this using Markov chains. The idea is that the next sample will depend only on some transition probability and the previous state. (This is also how bigram language models work.) |
The Gibbs sampling process looks like this:
# k is the number of variables.
zs = np.empty(T + 1, k) # One row for each of T samples z, where z is a vector of k values.
z[0, :] = z_init()
for t in range(T):
for i in range(k):
# Use the values that were already sampled this round
# and the old values for variables that haven't been sampled yet.
z[t+1, i] ~ P(Z_i | concat(z[t+1, 0:i-1], z[t, i+1:k])
(There’s a minor error in the TR’s formula, and I nearly hit send on an email to Dr. Resnik. Then I double-checked, and there’s already an errata page listing this problem.)
It feels a lot like the Gauss-Seidel method for solving linear equations. In this method, you sweep left-to-right across your array of k x-values T times, updating them one-by-one in terms of the others. “New values…are used as soon as you obtain them.” Block Gibbs, discussed later, will feel like the Jacobi method. But a nice thing about Gauss–Seidel is that it’s space-efficient: you only need one vector x. Here, you have to keep each vector because that’s the whole point—sampling. You progressively fill each element of the sample. The matrix gets filled in the same order that we read: left to right, line by line.
The tricky part is coming up with efficient conditional probability distributions we draw from.
Gibbs Sampling for Naive Bayes
We learn from the specific to the general. So let’s look at an example. The Naive Bayes (NB) model is fairly straightforward, because it assumes all of its features are independent of each other. In this example, we’re labeling documents into one of two classes. To do this, we use Bayes’ rule to flip the script, instead looking at a generative story.
\[L_j = \arg\max_L \Pr(L | \mathbf{W}_j) = \arg\max_L \Pr(\mathbf{W}_j | L) \Pr(L)\]In our generative story, we first sample a document’s label from our prior. Then we build the document. Weird, right? I’m only slowly getting accustomed to the idea of probabilistically generating the document. In our case, we’ve taken the NB assumption of independence between the features—the individual words. So each word is just a sample from a distribution over the vocabulary. The probabilities \(\mathbf{\theta}\) of each word are specific to each label. We draw for each position \(R_j\) like this:
\[\mathbf{W}_j \sim \mathrm{Multinomial}(R_j, \mathbf{\theta}_{L_j})\]Okay, hold on. What’s this prior? The prior on \(L_j\) is a Bernoulli (i.e., weighted coin-flip) distribution with parameter \(\pi\). What about the \(\pi\), though? (Feels like turtles all the way down.) Well, we know nothing, so we’ll pick a Beta prior—whose support (i.e., values you can sample from it) is \([0, 1]\). Good, because that’s the range of values that are allowed for \(\pi\). And we’ll use the hyper-parameters for it that make it uniform. (Yeah, those hyper-parameters were pulled out of nowhere. But now we have the uniform \(Beta(1, 1)\). We stick to this instead of a plain old uniform because Beta is a conjugate prior of Bernoulli.)
And what about our thetas? Well, the Dirichlet distribution can give an uninformed prior for the multinomial distribution, so we’ll just use that. Each \(\mathbf{\theta}\) is independently sampled from that Dirichlet. (The authors also note that the Beta distribution is a specific case of the Dirichlet.)
Back to our document-labeling example. How do we actually sample? First, we consider the space of possibilities. Our state is the value $\pi$, the two vectors \(\theta_0\) and \(\theta_1\), and the labels \(\mathbf{L}\). (We’ve observed the documents, so they don’t count. That’s also what makes me uncomfortable about claiming that we can generate them.)
So looking back at our pseudocode, what is z_init()
? It’s first sampling from our priors for \(\pi\) and \(\theta\), and then labeling each document using the probability \(\pi\). (Sure, sample $\theta$ last if you want. More power to you.)
Now we need to derive the joint distribution. I don’t have the MathJax patience for typing this out. Suffice to say, they take \(\pi\) out of consideration by integrating it out. Marginalizing. Bayesian stuff.
The authors also give a nice TL;DR of how to get the conditional distributions.
The main thing to observe is that in the crucial sampling step, the denominator is just the numerator without \(z_i^{(t)}\), the variable whose new value you’re choosing. So when you’re sampling conditional distributions in more complex models, the basic idea will be the same: you subtract out counts related to the variable you’re interested in based on its current value, compute proportions based on the remaining counts, then pick probabilistically based on the result, and finally add counts back in according to the probabilistic choice you just made.
Using labeled data
Just don’t sample their labels! You already know the answer. They’ll contribute to the counts and never disappear.
Obtaining values
Your Gibbs sampler needs time to converge to a stationary distribution. To avoid the initialization governing the space, sometimes people use “burn-in”—they’ll discard the first \(t\) samples. But don’t do that. Start your sampler in a reasonable place, then run it for as long as you can.
Autocorrelation, or dependence between successive samples, is also bad. Unfortunately, it’s the bedrock of Gibbs sampling. Some people skirt this by only taking every \(k^{\textrm{th}}\) sample from the sampler.
Some people use multiple chains. Don’t. Just run one. And run it as long as you can.
The bottom line:
- Understand Gibbs sampling before you use it.
- Don’t bother with multiple runs or burn-in if you can run your sampler forever.
- Each sub-iteration updates one variable, and the new value used for all later sub-iterations.
- The conditional distributions in NLP applications can often be found using counts.