# Interpolating between types and tokens by estimating power-law generators

(, 2006) at NIPS

Your statistical model is trained under the assumption that its training distribution matches the testing distribution. If you want to learn from type-level data like a dictionary or gazetteer, it’s not as simple as appending those to your training data; it skews the distribution. In natural language, words occur with a power-law (Zipfian) distribution. We should have generators that match this. The authors make this by augmenting their generator with a frequency adaptor.

We resort to a two-stage language model:

1. A generator, which produces words (perhaps a uniform multinomial distribution)
2. An adaptor, which governs their frequency.

Our adaptor of choice is the Pitman–Yor process (PYP), which generalizes the Chinese restaurant process. It’s a way of seating people at tables (or adding new ones) where there’s a preference toward tables which are already crowded. The tables are infinitely expandable; they represent some label or class. The probability of being seated at a given table (put in a given class $k$) is gnarly:

And what’s worse: it feeds into a bigger equation for the probability of the $i$th word, since the labels of the tables are not fixed.

% Here, $\theta$ is the parameter for our multinomial distribution. $K$ is the number of tables so far. $n_k^{\mathbf{z}_{<I}}$ is the number of people at table $k$ so far. $a$ and $b$ control the power law’s intensity and the preference toward new tables, respectively.

From this, you can get the probability of a sequence of words by summing over all category values $\mathbf{z}$ and table labelings $\boldsymbol{\ell}$.

When $a$ approaches 1, the probability of a new table for a token drops to zero—every token gets its own table. Approaching 0 instead gives a type-based system. “The sum is dominated by the seating arrangement that minimizes the total number of tables…in which every word type receives a single table.”

To test their model, the authors use essentially a toy task: English verb segmentation into stem and affix. While they don’t compare to a baseline, they show that whether evaluating on types or tokens, low values of $a$ helped—a preference toward types.

Written on April 9, 2020