Word Ordering Without Syntax

(Schmaltz, Rush, and Shieber, 2016) at EMNLP

This paper was part of the wave that showed how recurrent neural nets could be applied to heaps of tasks. They find that a neural language model without explicit knowledge of syntax is effective at rediscovering the word order of a shuffled sentence.


The problem is pretty straightforward. You’re given the words and punctuation of a sentence, scrambled. Instead of “Papa ate the caviar with a spoon .”, you get “. ate with the a caviar spoon Papa1—utterly unintelligible. The challenge is to rediscover the original order.

What makes this challenging is that a sentence of \(N\) words has \(N!\) possible arrangements. It’s NP-hard to evaluate each arrangement individually by brute force, to identify the highest quality.2


The authors believe that explicit syntactic information isn’t the key to reordering. Instead, they claim that the best ordering should be the most probable one according to a language model. They propose two language models: an n-gram model and an LSTM.

I assume you’re already familiar with how to produce an n-gram or LSTM language model; otherwise, take a look at JHU’s NLP course. The more interesting part of their approach is the decoding with beam search, a compromise between greedy decoding and exhaustive search that is standard in machine translation. The TLDR is that you produce hypotheses about which phrase is next, then keep only the K best before continuing.

def beam_search(
        phrases: Set[Phrase],  # Noun phrases or phrases containing one token
        K: int,  # Beam size
        g: Callable[[Set[Phrase]], Score],  # converts remaining phrases into a score
        model  # API is given in the GitHub Gist.
    M = len(phrases)  # Number of phrases may not be number of tokens.
    # Make empty beams.
    beams: List[Hypothesis] = [[] for _ in range(M + 1)]
    beams[0] = [Hypothesis([], phrases, 0.0, model.initial_state())]

    for m in range(M):  # keep consuming phrases:
        for k in range(len(beams[m])):  # for each hypothesis at this point:
            hypothesis = beams[m][k]
            (story, candidates, score, state) = astuple(hypothesis)

            for phrase in candidates:
                # Evaluate the best candidates at this point. Overgenerate, then cull
                new_score, new_state = score, state
                for word in phrase:  # Score this phrase.
                    new_score += model.score(word, new_state)
                    new_state = model.transition(word, new_state)
                j = m + len(phrase)

                new_hypothesis = Hypothesis(story + [phrase], candidates - {phrase}, new_score, new_state)

                # Extract top K from model.
                ordered = sorted(beams[j], key=lambda hyp: model.score_sequence(hyp.story) + g(hyp.candidates))
                beams[j] = ordered[:K]
    return beams  # The consumer can take the best from beams[m].

A runnable example (with a dummy model) is available here.


They use the Penn Treebank as data, and they optionally increase their vocabulary size by also using the Annotated Gigaword3 corpus, a JHU project. They’re actually incredibly lucid about their preprocessing; I’m certain that it could be reproduced rather easily with grep -w.

The authors try two task settings:

Words all provided independently of each other.
Base noun phrases (those without other noun phrases inside) are also provided.


The models are evaluated with BLEU score. In machine translation, BLEU is grudgingly accepted, but it’s a natural measure for this task—how many n-grams overlap with the reference?

When base noun phrases are included, the n-gram model beats a syntactic model by 5 BLEU, and the LSTM beats it by another 5. Omitting knowledge of base noun phrases makes for slightly smaller gains. Adding in Gigaword boosts scores by about 2 points. Larger beams (they tried widths from 11 to 512) also seem to consistently improve performance.

One last objection they refute is that syntax may be more helpful for highly distorted or longer sentences. They find instead that the LSTM outperforms the syntactic models over all lengths and distortion amounts. Further, parsing the sentences with the Yara shows that they’re syntactically better than the syntax-guided models.

Followup work by Song, Zhang, and Gildea (2018) show that incorporating the syntactic information into a neural model (the best of both worlds) improves performance further.


Syntax models aren’t necessary to learn word order from shuffled words. A neural language model is good enough. As always, more data helps. But it can be cheaply acquired because syntactic annotations are not necessary.

Fun bit: the unigram heuristic \(g\) they use for down-weighting upcoming characters.

  1. Courtesy of this bash pipeline: echo "Papa ate the caviar with a spoon ." | tr ' ' '\n' | sort -R | paste -s -d ' ' -. I’m always a fan of people learning more bash tricks. 

  2. See Knight (1999)

  3. The annotations aren’t necessary for Schmaltz et al.’s method; it’s only for the syntactic models that need it. 

Written on May 26, 2019