Finding Cognate Groups using Phylogenies
(Hall and Klein, 2010) at ACL
On Wednesday we talked about the comparative method, which reconstructs the phonemes and morphemes of an ancestor language from its descendants. Today is a computational take on it, automatically performing Step 1: assembling lists of cognates for large numbers of languages.
Both Lyle Campbell and the first paper in this series discussed the problems in creating cognate lists: borrowings and false positives. While people have addressed the problem at the language-pair level, a multilingual approach gives the chance to cover up mistakes. The authors present a generative model for cognate groups from a phylogenetic tree and monolingual (un-aligned) word lists.
The generative model has three parts: survival, evolution, and alignment.
For survival, they note that cognates wind up clustered rather than randomly spread. They choose G ~ geom(p) to be the number of cognate groups. Then, for each group g, they decide at each branch whether the word survives and is passed farther. Each language/group pair has its own probability. Remember that we’re dealing with abstract words at this point. This step decides how much of the tree our cognate group covers.
In the evolution step, they create models for each edge, generating the daughter word from the parent. They have a parameterized editing model, expressible by a weighted finite-state transducer. Each ancestor–descendent edge has its own editing model. The probability of a specific descendent word given its ancestor is the sum over all possible derivations—that is, we marginalize out the derivation to get p(desc | ances). The authors note that an FST isn’t the best model because it can’t learn context.
Once the words have been created according to the evolution model, we need an injective mapping from words to cognate groups. You have an injective function π_ℓ, which from another perspective generates the positions of words in the vocab list. Learning this alignment is the duty of the entire paper.
To do inference, they minimize the log probability of a word taking on a position given the parameters, the injective alignment function, and the other languages. This is generally intractable. They instead sequentially improve each language. It amounts to a bipartite matching problem. Famous people (first Jacobi) proved that this can be solved in polynomial time.
Their learned transducer model performs very well. They even report good performance on the reconstruction task, identifying the closest Latin word to each cognate group. They call their reconstruction error relative to a system with oracle cognate assignments to be 3.8 versus 3.0 Levenstein points. This isn’t actually that good, is it? Not enough to call it a “small increase in error”, at least. Without giving a random baseline, we don’t know how good it is to be close to this. (Evaluation metrics are the hill I will die on. Stuart Shieber taught me that a result is two numbers, and my addendum is that the two numbers need to be the right ones.) If we baselines for varying amounts of intelligence in the three stages of the model, used to reconstruct the Latin words, I’ll be happy.