Offline Bilingual Word Vectors, Orthogonal Transformations, and the Inverted Softmax

(Smith et al., 2017) at ICLR

Like Monday’s paper, this paper doesn’t really do translation. Today’s evaluates automatically generated word dictionaries. How do you align two spaces of word vectors to each other? It’s been shown that it’s possible with a linear transformation. This work says that the linear transformation should also be orthogonal—making it more robust to noise. They construct a bilingual dictionary from overlapping strings (sounds like Ravi and Knight’s ‘linguistic expertise’) and outperform prior work.

First off, why are we aligning these spaces? Because it’s useful for translation to have one common semantic space for the source and target languages. Usually, the mapping is learned while the vectors are being adjusted (“online”). The mappings can also be learned offline. Simply by fitting a transformation matrix between the spaces, Mikolov et al. (2013) achieved 33% accuracy in translating English words to Spanish.

The authors present a very nice analogy for how this could work.

To develop an intuition for these two approaches, we note that the similarity of two word vectors is defined by their cosine similarity. The vectors have no intrinsic meaning, it is only the angles between vectors which are meaningful. This is closely analogous to asking a cartographer to draw a map of England with no compass. The map will be correct, but she does not know which direction is north, so the angle of rotation will be random. Two maps drawn by two such cartographers will be identical, except that one will be rotated by an unknown angle with respect to the other. There are two ways the cartographers could align their maps. They could draw the maps together, thus ensuring that landmarks are placed nearby on both maps during “training”. Or they could draw their maps independently, and then compare the two afterwards; rotating one map with respect to the other until the major cities are aligned.

The big trick here is that the singular value decomposition (which everyone should learn about) is a closed-form tool to get an exact solution to the optimization problem, like how least-squares is the tool for a best-fit linear transformation. (Ouch, it was also recently proposed elsewhere.) The cool thing about SVD is that it’s really easy to drop low-variance components by deciding that they’re noise.


Let’s look at why the transformation \(W\) between the spaces \(X\) and \(Y\) should be orthogonal. We do this with a similarity matrix \(S = YWX^\intercal\). Each element \(S_{ij}\) contains the dot-product similarity of the word vectors \(y_i\) and \(Wx_j\) (the transformed x).

We could also imagine a second similarity matrix for transforming in the other direction: \(S’ = XQY^\intercal\). (1) We want the similarities in each position to match up—it shouldn’t matter which space we call X and which we call Y. In notation, \(S’ = S^\intercal\). Let’s get the expression for the transposed similarity matrix: \(S^\intercal = XW^\intercal Y^\intercal\), so \(Q = W^\intercal\). (2) We must be able to transform a word back into itself: \(x = W^\intercal Wx\). The family of matrices with this property \(W^\intercal W = I\) is called orthogonal. Orthogonal matrices preserve distances: If we normalize \(X\) and \(Y\), the entry in \(S\) is the cosine similarity—a measure of the angle between the vectors, which is what we care about.


The other trick is what they call the “inverted softmax”—more of a softmax over a axis of the same matrix. By using it, they avoid the problem of “hubs”—words in the target space that many source words map to. Instead, this lets them ask which target word maps back onto the source word best.

Their method seems to work better than the general linear fit method, as well as the canonical correlation analysis–based method proposed by Faruqui and Dyer. It drops off for rare words, but isn’t that pretty universal? With that in mind, they’re able to bump scores way up, even without a (true) bilingual dictionary. Performance jumps from 1% or 3% to 40% or 34%.

The useful tricks:

  • Using orthogonal matrices to transform one vector space into another
  • SVD for solving the orthogonal Procrustes problem
  • Deleting parts of the matrix with low singular values, so you don’t learn noise.
  • “Inverted softmax” to deal with “hub” words.

NB: The idea that the transformation should be orthogonal seems to have already been poked at by Hamilton et al. (2016). They also note that Artexte et al. (2016) presented the same idea independently, shortly before they did. (They even recognize that all three are solving the same problem )

Written on June 6, 2018