Transfer Learning for Low-Resource Neural Machine Translation

(Zoph, Yuret, May, and Knight, 2016) at EMNLP

There’s an incomplete version of this paper on arXiv. Please don’t do that with your own papers.

A widespread belief about neural machine translation is that it can’t work well for low-resource languages, which is why my advisor still develops non-neural methods. Zoph et al. have a different idea for how to handle low-resource MT: piggybacking on models trained in high-resource scenarios.

Language Train Size Test Size SBMT Bleu NMT Bleu Delta
Hausa 1.0M 11.3K 23.7 16.8 -6.9
Turkish 1.4M 11.6K 20.4 11.4 -9.0
Uzbek 1.8M 11.5K 17.9 10.7 -7.2
Urdu 0.2M 11.4K 17.9 5.2 -12.7

The authors use all of the normal NMT goodies, including input feeding from Luong et al.: at each of the decoder’s time steps, the top layer’s hidden state is fed into the lowest layer of the next time step—a fun little loopy pattern.

The more interesting thing that they do here is transfer learning. One example of this would be to share a source encoder from one language with different target-language decoders. This is a bit different from multitask learning because we optimize the encoder with one decoder, then hope it extends to others. This is helpful in low-resource scenarios: you can learn to decode to English in a high-resource scenario, so you only have to learn your encoder weights on the limited data. (I know, I know. English chauvinism. It comes with the territory—our grant is only for translation into English.) While previous work has done this, the authors are working in a low-resource scenario. Additionally, the high- and low-resource data are from different domains—they claim this is more realistic.

Their specific process is to train a model on French–English, which they call the parent. They then train this already-initialized model on the new language pair, e.g. Uzbek–English. This is the child model. This amounts to developing a strong prior over possible models: Which models are more likely to exist? The ones closer to the trained one, rather than close to one randomly initialized. Uzbek word would be mapped to random French embeddings, and these embeddings will be updated in training into sensible ones, but the English embeddings are already sensible.

The authors note that their model is competitive with a string-to-tree non-neural model when they ensemble 8 neural models together. They also use their model to re-score the non-neural model, comparing to the non-transfer baseline and a neural language model. They find that their transfer learning model outperforms the others.


The authors conduct a few experiments to show the effects of various components.

  • They found (rather anecdotally—two language pairs) that a closer parent language to the child improved performance.
  • They made a synthetic French by randomly permuting the vocabulary, then found that transfer into fake-French was better than transfer into Uzbek—50% larger gain in BLEU over the baseline.
  • Their learning curves show that the models are training fine, but the transfer models generalize better. This is the issue with finding local minima.
  • They froze certain components of their model, finding that the best performance came from freezing everything except the input and output target embeddings:

The bottom line: Transfer to generalize works. You can also get fine performance by freezing parts of your model, reducing the number of gradients you need to compute.

Written on May 23, 2018