Semi-amortized variational autoencoders
(Kim, Wiseman, Miller, Sontag, and Rush, 2018) at ICML
Latent variable models such as variational autoencoders (VAE) have been investigated in language modeling and conditional text generation, because they can handle data uncertainty and increase expressivity. VAEs in particular are optimized with an approach called amortized variational inference (AVI). There’s a gap, though, between what AVI can give and the true log-likelihood objective. (This is in addition to the approximation gap.) The authors (including my former mentor Sam Wiseman) proposed a learning strategy that uses higher-order gradients to combine AVI with stochastic variational inference (SVI). The consequence is that they mitigate posterior collapse, as visualized through saliency mapping.
Background and approach
I doubt anyone could lay out the terrain here better than the first three paragraphs of the paper. When we have an intractable distribution (such as a generative latent variable model \(p_\theta(x) = \int_z p_\theta(x, z) \mathrm{d}z\) where \(z\) is continuous and unobserved), we can use a tractable substitute \(q_\lambda \in Q\) to approximate it, optimizing to select the best member of the family \(Q\). This is called variational inference.1
Two ways to learn these distributions are stochastic variational inference (SVI) and amortized variational inference (AVI). In SVI, we learn parameters \(\lambda\) for every data point (via gradient ascent). You can get a tight fit, but optimizing for every point is expensive. In AVI, we instead use a network shared (amortized) over the dataset to predict data point–specific parameters.
Variational autoencoders are a type of deep generative model. They are learned with AVI, jointly training the generator and the parameter-predicting network. Unfortuntely, the amortization gap between the true log-likelihood and the VAE’s ELBO objective can be large. (This is distinct from the approximation gap due to the variational family’s inability to fully represent the true distribution.)
While there are a dozen approaches to minimize this gap, this paper presents an exciting one: semi-amortized variational autoencoders. Their approach “differentiates through SVI while training the inference network/generative model.” The model combats the posterior collapse problem in variational autoencoders for text.
Stochastic variational inference
This is the local approach, learning \(\lambda\) for each \(x\) (or each batch of x’s).
- Sample a data point \(x\) from the training data.
- Randomly initialize \(\lambda\).
- Perform gradient updates to improve \(\lambda\) with respect to the ELBO objective, some \(k\) times.
- Now use gradient updates on the generator’s parameters \(\theta\).
Amortized variational inference
This is the global approach: a shared network \(\mathrm{enc_\phi}\) predicts parameters \(\lambda\) for each point.
- Sample a data point \(x\) from the training data.
- Set \(\lambda = \mathrm{enc_\phi}(x)\).
- Perform gradient updates to improve the generator’s parameters \(\theta\) with respect to the ELBO.
- Now use gradient updates on the encoder’s parameters \(\phi\).
Semi-amortized VAEs
This combines both approaches, using the encoder to initialize the SVI step.
- Sample a data point \(x\) from the training data.
- Set \(\lambda = \mathrm{enc_\phi}(x)\).
- Perform gradient updates to improve \(\lambda\) with respect to the ELBO objective, some \(k\) times.
- Now use gradient updates on the generator’s parameters \(\theta\).
- Now use gradient updates on the encoder’s parameters \(\phi\).
This involves backpropagation through gradient descent, as \(\lambda\) has developed a history. The authors appeal to LeCun et al. (1993) and Pearlmutter (1994).
Experiments
Matching a Synthetic Distribution
The authors create synthetic 5-character text sequences according to a generative model whose parameters are known. They then fix some of the parameters (the LSTMs that receive inputs from two normal distributions). The goal, then, is to match the remaining parameters. Their approach beats both AVI and SVI, and it achieves a higher lower-bound on the data.
Modeling text
Same, but real-world data. Their model is an LSTM, fed the variational posterior’s sample at every step.
Images
I was less interested in images, so I won’t summarize this part. While they don’t outperform the state of the art, their improvements are ‘orthogonal’ to the state of the art’s bag of tricks.
Discussion
To convince us that the latent variable is useful, they produce a saliency map. Based on the gradient of the likelihood with respect to the latent variable, they can infer which words rely on the variable more or less. It seems important to determining content words, sentence length, and the type of question word.
Still, as the authors note, their model is expensive if the generative model is expensive.
Tricks/hacks
Every paper with a neural network in it seems to have fifty hidden tricks to keep the whole edifice from falling apart. This paper uses:
- Gradient (and Hessian) clipping during training to avoid exploding gradients and Hessians
- KL annealing (a scheduled alteration of the ELBO objective, gradually increasing the weight of the KL term) to mitigate posterior collapse. (This may not be necessary, if you’re willing to modify the loss function.)
-
Variational inference gets its name from the calculus of variations, which is all about optimizing over sets of functions. In this case, we’ve picked a set of functions \(Q\), and the function we optimize is a lower bound on the log-likelihood of the data. ↩