# Two thoughts on group theory

Each semester, I work through one or two textbooks. I’ve just picked up Dummit and Foote’s Abstract Algebra and begun chapter 1. (Weirdly, my math degree never introduced me to abstract algebra.) Two thoughts come to mind.

## Faces, vertices, edges

First, I had a different route for solving Section 1.2’s problem 10 than the two suggested in the book. It’s not a particularly involved problem, but it’s still interesting.

# Semi-amortized variational autoencoders

Latent variable models such as variational autoencoders (VAE) have been investigated in language modeling and conditional text generation, because they can handle data uncertainty and increase expressivity. VAEs in particular are optimized with an approach called amortized variational inference (AVI). There’s a gap, though, between what AVI can give and the true log-likelihood objective. (This is in addition to the approximation gap.) The authors (including my former mentor Sam Wiseman) proposed a learning strategy that uses higher-order gradients to combine AVI with stochastic variational inference (SVI). The consequence is that they mitigate posterior collapse, as visualized through saliency mapping.

# Interpolating between types and tokens by estimating power-law generators

Your statistical model is trained under the assumption that its training distribution matches the testing distribution. If you want to learn from type-level data like a dictionary or gazetteer, it’s not as simple as appending those to your training data; it skews the distribution. In natural language, words occur with a power-law (Zipfian) distribution. We should have generators that match this. The authors make this by augmenting their generator with a frequency adaptor.

# A Distributional and Orthographic Aggregation Model for English Derivational Morphology

Derivational morphology is that awkward, messy uncle that nobody really talks about, compared to its golden-boy sibling inflectional morphology. The authors present an improved system for generating derivational forms from their base words.

# Word Ordering Without Syntax

This paper was part of the wave that showed how recurrent neural nets could be applied to heaps of tasks. They find that a neural language model without explicit knowledge of syntax is effective at rediscovering the word order of a shuffled sentence.

# Optimizing Segmentation Granularity for Neural Machine Translation

Byte pair encoding (BPE) is now a standard way to preprocess data in neural machine translation. It combats data sparsity by creating a vocabulary of subwords to use instead of words. The subword units may be more common than the words, especially in morphologically rich languages. The question remains how many subwords to use. People tend not to sweep this hyperparameter. This paper does it automatically (for the target-side vocabulary), without adding training time. It doesn’t make a huge improvement in BLEU score, but it’s a fascinating method.

# ‘Lighter’ Can Still Be Dark: Modeling Comparative Color Descriptions

Sometimes one word on a matter isn’t enough. To describe color, we resort to compositional names: “light blue”, “cherry red”. This paper looks at a group of these: comparative names. Those are names like “duller olive” or “slightly paler”.

# An Exploration of Neural Sequence-to-Sequence Architectures for Automatic Post-Editing

Automatic post-editing (APE) exists because our translation models are fallible. (Hopefully, they’re also fallible in reliable ways.) Once your translation system has spit out a predicted translation of a sentence, you can feed both the prediction and the original, foreign sentence into your APE system to clean it up, making it feel more human. This paper presents a neural architecture for APE—it’s similar to the traditional neural MT architecture, but with dual attention—one over the source, and one over the translation output.

# Neural Arithmetic Logic Units

It’s time to restart in preparation for the new semester!

Neural networks have trouble with things as simple to us as learning the identity function $f(x) = x$. Within the numerical range they’re trained on, they do fine, but they can’t extrapolate.

# Improving Low Resource Machine Translation using Morphological Glosses

Translation models demand oodles of training text. In low-resource languages, we need tricks to circumvent the data famine. Languages with many inflected forms for each base word make the famine even scarcer. This paper presents a trick for feeding the hungry by multiplying our few loaves and fish: glossing words and morphological annotations. (Also—my advisor is on this paper :D .)

# Neural Factor Graph Models for Cross-lingual Morphological Tagging

Good morphological taggers (the software that sees hablaste and recognizes that it’s in the preterite tense, second-person singular) are important because they’re used to improve later tasks. Today’s paper presents a state-of-the-art neural tagger based on neural nets and conditional random fields. I’d previously looked at combining neural nets and graphical models, and I think this is a better example.

# Learning Hidden Unit Contribution for Adapting Neural Machine Translation Models

This NAACL paper seeks to perform domain adaptation while learning only a few model parameters. It does this by learning to amplify or diminish the values coming from hidden nodes (their contributions), thus the paper title: “Learning Hidden Unit Contribution”. This is useful when you have a lot of out-of-domain text to train with, but very little in-domain text.

# Conditional Time Series Forecasting with Convolutional Neural Networks

Convolutional networks are nice for sequence prediction, because the computations can be parallelized, unlike in recurrent networks. (I’ve previously given a cutesy demo of how the convolutional filter slides over the input.) Evidently these can be used for time series forecasting, and it looks a lot like neural language modeling.

# Using Aspect Extraction Approaches to Generate Review Summaries and User Profiles

I’m incredibly proud of a friend that I met in elementary school, Veronica Wharton, for her publication at NAACL this year. She and her colleagues at Airbnb have exploited maximum entropy classifiers to pick apart users’ reviews.

# Offline Bilingual Word Vectors, Orthogonal Transformations, and the Inverted Softmax

Like Monday’s paper, this paper doesn’t really do translation. Today’s evaluates automatically generated word dictionaries. How do you align two spaces of word vectors to each other? It’s been shown that it’s possible with a linear transformation. This work says that the linear transformation should also be orthogonal—making it more robust to noise. They construct a bilingual dictionary from overlapping strings (sounds like Ravi and Knight’s ‘linguistic expertise’) and outperform prior work.

# Parsing and Hypergraphs

My favorite things to combine: graph theory and NLP. Today’s paper takes a look at parsing from the lens of Dijkstra’s algorithm for shortest paths. The authors plainly state the key idea of this paper:

There is intuitively very little difference between (a) combining subtrees to form a tree, (b) combining hypotheses to form a conclusion, and (c) visiting all tail nodes of a hyperarc before traversing to a head node.

They use this view to propose a parsing algorithm.

# Neural Poetry Translation

Douglas Hofstadter’s Le Ton Beau de Marot points out that machine translation is mainly driven by military and industry. There’s little room for nuance and beauty. Ghazvininejad et al. address that problem.

# Composing graphical models with neural networks for structured representations and fast inference

NB: I link to an updated version of the paper on arXiv. Wish they could retroactively correct the NIPS version.

My office the summer I worked at Harvard was right across from Ryan P. Adams’s. Cool to see that he’s making contributions at Twitter. The gist of this paper is that they’ve connected nonlinear likelihoods to latent-variable graphical models.

# Deciphering Foreign Language

Did you like those puzzle games as a kid? The ones where you had to unscramble or shift letters to find a secret message? That’s the approach that Ravi and Knight take to translation in the zero-resource setting: when no parallel text is available in the target language.

# Storks Deliver Babies (p = 0.008)

Another facetious but highly informative paper. Is there a connection between storks and babies? Yes, with a low enough p-value that people might bat an eye.

# Extreme Adaptation for Personalized Neural Machine Translation

While I was at MTMA, a quick one-night project I explored with Brian Thompson was whether we could fiddle with biases in the final softmax layer of an NMT model to perform domain adaptation, purely off word frequency. Short answer: no, or at least not how we did it.

Michel and Neubig thought of bias-meddling [and made it work (!)]. Instead of adapting to new domains, they adapted to individual speakers by learning how to update the biases correctly. This paves the way for speaker-specific MT, as well as inferring properties of speakers.

# A Generalization of Principal Component Analysis to the Exponential Family

Principal component analysis was the dimensionality reduction technique we learned in my machine learning class at SMU. The start of every pipeline was center, scale, and run PCA. Unfortunately, as it’s presented, it’s designed for continuous data. This paper extends it to other data types: integers and booleans.

# Notes from the 4th Machine Translation Marathon of the Americas

History never repeats itself, but the kaleidoscopic combinations of the pictured present often seem to be constructed broken fragments of antique legends.

Mark Twain

Warren Weaver gave the first introduction to machine translation, claiming that our goal was “erecting a new Tower of Anti-Babel.” He talked about a “mythical” time when people could communicate simply with one another.

# Transfer Learning for Low-Resource Neural Machine Translation

There’s an incomplete version of this paper on arXiv. Please don’t do that with your own papers.

A widespread belief about neural machine translation is that it can’t work well for low-resource languages, which is why my advisor still develops non-neural methods. Zoph et al. have a different idea for how to handle low-resource MT: piggybacking on models trained in high-resource scenarios.

Language Train Size Test Size SBMT Bleu NMT Bleu Delta
Hausa 1.0M 11.3K 23.7 16.8 -6.9
Turkish 1.4M 11.6K 20.4 11.4 -9.0
Uzbek 1.8M 11.5K 17.9 10.7 -7.2
Urdu 0.2M 11.4K 17.9 5.2 -12.7

# Recurrent neural network based language model

Neural networks, the broken record of modern NLP. This paper presents a 50% reduction in the perplexity of a language model when RNNs are used, instead of n-gram backoff. This paper was cool because it’s a rare example of both better results and faster training.

# Learning Global Features for Coreference Resolution

Wiseman et al. previously used a neural net to rerank a coreference resolution system, and now they’re augmenting that by having a recurrent neural network (RNN) learn representations of co-referent clusters straight from the text—they’ve made it end-to-end. Their claim is that cluster-level features are a challenge to concoct, so why not let a machine handle it? And in doing so, they improve on the state of the art.

# Improving Character-based Decoding Using Target-Side Morphological Information for Neural Machine Translation

Translating in “morphologically rich languages” (those with a lot of verb conjugations and noun declensions) is difficult. These authors seek to circumvent the issues by including morphological tables as inputs to their decoder.

(Attention over a morphological lookup table throughout the Turkish word terbiyesizlik, which can be decomposed as terbiye-siz-lik.)

# Transition-Based Dependency Parsing with Stack Long Short-Term Memory

Today’s paper tackles a method for putting a continuous representation inside of a dependency parser.

# Learning Language Representations for Typology Prediction

Today we enter the wonderful world of linguistic typology: inventories of languages’ features. Databases of these features have been compiled, such as the World Atlas of Language Structures. The details of typology can be useful for downstream tasks.

The authors propose a model of learning these typological features from a translation system into English. The goal is to fill in gaps in a database of features. They outperform a baseline relying on geography and phylogeny.

# Learning Symmetric Collaborative Dialogue Agents with Dynamic Knowledge Graph Embeddings

A good day is when an NLP paper talks about network science. (Except when they reinvent ideas.)

As an aside, I’m in a great mood today. I hope you are, too.

# Making Sense of Word Embeddings

Mmm, word sense plus graph theory. I’m in heaven. This paper proposes a way to determine word senses (is that band the bracelet or band the musical group) from normal embeddings using clustering and ego networks. Their results are comparable to the state of the art. Unfortunately, the state of the art doesn’t beat the majority-class heuristic.

# Gibbs Sampling for the Uninitiated

Kevin Knight’s tutorial on Bayesian methods is one of the most approachable, human pieces of writing to describe a highfalutin concept. This technical report from the University of Maryland at College Park applies that style to Gibbs sampling. It’s less afraid to introduce maths, and it still grounds the problems in our intuitions as NLP researchers.

# The Phrasal Lexicon

This paper is rather tongue-in-cheek, and it’s a real delight to read. It argues that phrases, not just words, represent free-standing concepts. We manipulate words and phrases to assemble new meanings. It has just enough of a point to it that it doesn’t deserve publication in SIGBOVIK, and there’s a twist ending that I won’t give away.

# Towards Neural Phrase-based Machine Translation

Ah, phrase-based machine translation. Translating spans of words, rather than individuals. It harkens back to a quainter time, before the neural craze. And now, it seems that we’re marrying it to the neural craze. On some simple experiments, the model (a venture by Microsoft Research) outperforms an NMT baseline and gives outputs that humans like, too.

We’re now up and kicking with an RSS feed! Thanks to Tongfei Chen for the tip. It’s easy to add an RSS feed to sites created with GitHub Pages (like mine) using this tutorial. You can now subscribe to this page using your favorite RSS reader. (Does anyone have a favorite? I’m on the market.)

# Cumulative cultural evolution in the laboratory: An experimental approach to the origins of structure in human language

Today’s paper relates to yesterday’s by looking at humans in the process. It talks about language as a culturally constructed artifact. Every time we produce an utterance, we convey not only the meaning behind the utterance, but also details about the artifact—the nature of the language.

# Simpler Grammar, Larger Vocabulary: How Population Size Affects Language

The title makes this paper’s message pretty clear. When the speaker pool of a language gets large, easy-to-transmit information like vocabulary grows, while harder ideas like grammar tend do simplify. The authors present a network science model to show the diffusion of these concepts. With it, they conjecture that globalization will engender a simplification of grammatical structure.

# A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning

You know what’s cool? Doing multitask learning that’s actually multitask learning, instead of arbitrarily calling something multi-task (like morphological re-inflection). This paper presents a deep model that attempts to do it all for NLP: tagging, chunking, NER, semantic roles, and on and on.

# Bayesian Learning of a Tree Substitution Grammar

We’re getting super Bayesian today. If you’re not ready, check out Kevin Knight’s tutorial on Bayesian statistics for NLP. In fact, his paper specifically talks about how to learn tree substitution grammars. (What does it mean to learn a grammar? We assume that we already have the derivation rules, and we want to learn their relative chances of being used given the context.) This paper tackles it from a more concrete angle, mixing Gibbs sampling to deal with the complexity of the space and nonparametric priors to address overfitting.

# Natural Language Does Not Emerge 'Naturally' in Multi-Agent Dialog

Did you like yesterday’s paper? Too bad. Today’s calls it out. (Not because its methods were bad, but because it’s a stretch to call what it did “natural language”. Yoav Goldberg has presented a similar critique of other work.) This paper calls out some shortcomings in approaches before yesterday’s: that passive, corpus-based conversation agents never learn grounding: the mapping between physical concepts and words. It also points out something that yesterday’s Lewis game agents couldn’t do: compositionality. Because the Lewis sender is restricted to one symbol, how could it ever compose ideas?

# Multi-Agent Cooperation and the Emergence of (Natural) Language

Remember when Google had two computers devise a crypto system that humans can’t interpret? There are other multi-agent games that can find useful coding of information. This paper presents a model of two agents who learn to describe images to each other, as well as how to turn those messages into natural language.

Spoiler: I’d hardly call what it generates “natural language”. It labels images. With single words.

# Translating Pro-Drop Languages with Reconstruction Models

Translating from pro-drop languages to non-pro-drop languages is a challenge. How do you synthesize the implied information, when your model can only think on the level of words, not grammar? If the bread is delicious, you want to ask, “Did you bake it?”—not “Are you baked?”

Today’s paper presents a two-step process for doing this: automatically add the dropped pronouns back into the source, then use a multi-task model to predict both the annotated source and the translation from the original source.

# Joint Processing and Discriminative Training for Letter-to-Phoneme Conversion

Ah, joint learning. It’s cool, isn’t it? It lets information flow both ways: the intermediate representations have to be good for later tasks. This paper jointly learns input segmentation, phoneme prediction, and sequence modeling.

# DopeLearning: A Computational Approach to Rap Lyrics Generation

The authors devise a method for ranking candidate rap lines conditioned on the previous lyrics. They report a 21% increase over human rappers in terms of rhyme density. Their method can be applied to other text synthesis problems such as customer service chatbots.

# Character Sequence Models for Colorful Words

Today we get to rip of NMT literature for multi-modal work. This time for predicting colors given their names. Woo hoo. Also, they’ve got a demo going. Type words, get colors.

# Synthetic and Natural Noise Both Break Neural Machine Translation

This paper endeavors to show the brittleness of neural machine translation systems. When typos or shuffling exist within words, performance plummets.

# Diachronic Word Embeddings Reveal Statistical Laws of Semantic Change

Today’s paper looks at how meanings evolve. The authors claim to have uncovered two laws of semantic change. The first is the “law of conformity”: the rate of semantic change scales inversely with frequency. The second is the “law of innovation”: meanings change faster in words with more senses.

# Best Practices for Scientific Computing

(Wilson et al., 2014) in PLoS biology

We spend so much time writing code, but software engineering classes are generally garbage. Plus, nobody writes tests for experimental code being run on a cluster. We just lurch from bug to bug until we get a working version.

Today’s paper, recommended by Tom Lippincott, presents a few antidotes for scientific software struggles.

# Automatic Discrimination between Cognates and Borrowings

(Ciobanu and Dinu, 2015) at ACL

It is said that “no man is an island.” Most definitely, no language is an island. They borrow from one another to extreme degrees. Today’s paper comes from two authors at the University of Bucharest, in the middle of the notorious Balkan Sprachbund. This is a natural area to study the difference between cognates—which came from a shared ancestor—and borrowings. They use orthography to discern.

# Non-Autoregressive Neural Machine Translation

(Gu et al., 2017) on arXiv; submitted to ICLR 2018

Neural machine translation is slow because of how we decode. This paper calls the popular method “auto-regressive”: each output token depends on all previous tokens. This is inherently sequential and cannot be parallelized, so this paper introduces the workaround. There’s nothing new under the sun—they reintroduce the fertility parameters from the old IBM models. They also rely on an auto-regressive teacher, so the model is trying to see farther by standing on the shoulders of giants.

# Bayesian Inference with Tears

Kevin Knight describes a very real phenomenon for a first-year grad student: everyone is throwing out expressions like “Integrate over the possible values” or “Don’t you see how this is a generative model?” and you want to quit school to move to the woods of North Carolina and bagpipe on the side of a hill. When the Bayesian craze struck, Knight had to work and avoid being left behind, so he wrote a workbook to teach others.

# A View of the EM Algorithm that Justifies Incremental, Sparse, and Other Variants

(Neal and Hinton 1998) in Learning in Graphical Models, ed. Michael I. Jordan.

# How Brains are Built: Principles of Computational Neuroscience

(Granger, 2011) in Cerebrum

This paper treats the word “computational” not as something pertaining to computers, but rather as David Marr would use it. Marr identifies three levels of phenomena: the computational, the algorithmic, and the implementational. (Pylyshyn refers to these as the semantic, syntactic, and physical.) It’s outside of NLP, but I’ve meant to read it for a long time—a glimpse into the state of computational neuroscience as it attempts to adapt models from the brain to other purposes.

# Finding Cognate Groups using Phylogenies

(Hall and Klein, 2010) at ACL

On Wednesday we talked about the comparative method, which reconstructs the phonemes and morphemes of an ancestor language from its descendants. Today is a computational take on it, automatically performing Step 1: assembling lists of cognates for large numbers of languages.

# Non-distributional Word Vector Representations

(Faruqui and Dyer, 2015) at ACL

(Busy day, so I’m looking at a short paper!)

This paper is a critique of word vectors. Because word vectors are difficult to interpret, the authors present extremely sparse word vectors drawing from known semantic and syntactic resources. If your language has the resources to make these vectors, they’re inexpensive to produce. They’re clear, and they’re competitive.

# The Comparative Method and Linguistic Reconstruction

(Campbell, 1998). From Historical Linguistics: An Introduction

Today, we’re going in a different direction: a dive into the past. We’re looking into how linguists create the phylogenetic relationships between languages, as well as how they reconstruct the proto-languages that turned into our present-day tongues.

# Decoding Anagrammed Texts Written in an Unknown Language and Script

(Hauer and Kondrak, 2016) in TACL

What mysteries does the Voynich Manuscript hold? Its unknown writing system has tantalized linguists for a century. Today’s paper makes some informed judgments about the language it might be written in, as well as presenting a more general pipeline for deciphering unknown text. Their results show that Hebrew is an outlier among the languages they test, suggesting the most probable language for the text.

# From Characters to Words to in Between: Do We Capture Morphology?

(Vania and Lopez, 2017) at ACL

We know that words exist, and we know that we can break them up into smaller parts—”rereading” can be broken into the morphemes ["re-", "read", "-ing"], or into the characters ["r", "e", "r", "e", "a", "d", "i", "n", "g"]. Today’s paper explores which of these levels is the best representation for language modeling. It also shows the benefit of morphological analysis—without it, even tons more data won’t carry the day. A model trained on true morphological features outperforms eight character-based models.

# Found in Translation: Reconstructing Phylogenetic Language Trees from Translations

(Rabinovich et al., 2017) at ACL

I’m starting a new thing where I write about a paper every day, inspired by The Morning Paper. Let me know what you think.

Today’s paper addresses the question of whether the act of translation or the source language has greater effect on the resultant translation. They go so far as to claim:

The signal of the source language is so powerful that it is retained even after two phases of translation.

Exploiting this, they recreate the phylogeny of languages using only monolingual data. This is interesting from a historical linguistics perspective, and it’s also useful for native language identification (NLI).

# Neural Machine Translation by Jointly Learning to Align and Translate

(Bahdanau et al., 2014) orally at ICLR 2015

I’m starting a new thing where I write about a paper every day, inspired by The Morning Paper. Let me know what you think.

This paper was the first to show that an end-to-end neural system for machine translation (MT) could compete with the status quo. When neural models started devouring MT, the dominant model was encoder–decoder. This reduces a sentence to a fixed-size vector (basically smart hashing), then rehydrates that vector using the decoder into a target-language sentence. The main contribution of this paper, an “attention mechanism”, lets the system focus on little portions of the source sentence as needed, bringing alignment to the neural world. Information can be distributed, rather than squeezed into monolithic chunks.

# Phylogenetics of Indo-European Language Families via an Algebro-Geometric Analysis of their Syntactic Structures

(Shu et al., 2017) on arXiv

I’m starting a new thing where I write about a paper every day, inspired by The Morning Paper. Let me know what you think.

Shu et al. look at the similarities and descencdance between languages using the same techniques for comparing phylogenies of organisms. This isn’t the first time that this similarity has been looked at, but their model lets language families be compared to each other.

# We all pick our hill to die on

And mine is evaluation metrics. Don’t be obstinate in choosing them, and don’t report only the ones that satisfy your hypotheses.

I’m picking this fight ten years too late as a project on word sense induction (WSI) has led me to a pivotal paper on evaluating WSI—a pivotal paper which irks me.

### Prelude

The work is Agirre & Soara (2007). They performed an unsupervised evaluation of WSI as a clustering problem. My mind leaps to the popular measure normalized mutual information (NMI) or its improvement adjusted mutual information (AMI). Granted, AMI was unveiled in 2009, so it’s fair that AMI wasn’t picked. But the choices they did make were weird.

The chosen measures were F-score, entropy, and purity. And the authors decided that F-score would be their golden boy. The others are merely included “for the sake of completeness.”

• Purity: how much do clusters contain elements of just one class? Higher ↔ better
• Entropy: how…mixed-up…are the classes and clusters? Lower ↔ better; suggests more agreement between classes and clusters
• F-score: balances precision and recall. Popular in NLP. Weird here, and demonstrably flawed. Higher ↔ better.

Hoo boy.

## A meta-evaluation

Let’s look at how the scores agree with one another. We’ll use a special correlation coefficient called Spearman’s rho ($\rho$), designed to compare two sets of rankings. Like Pearson’s $r$, values close to 0 mean no signal, and values close to ±1 have high agreement. (A $\rho$ of -1 would suggest that one rank is the reverse of the other.)

System F-score Purity Entropy
1c1word* 78.9 79.8 45.4
UBC-AS 78.7 80.5 43.8
ubv_si 66.3 83.8 33.2
UMND2 66.1 81.7 40.5
I2R 63.9 84.0 32.8
UofL 61.5 82.2 37.8
UOY 56.1 86.1 27.1
Random* 37.9 86.1 27.7
1c1inst* 9.5 100.0 0.0

Systems are ordered by their F-scores in the paper, and we stick to that convention. Anything with a star beside it is a baseline.

### The meat

Also, it seems like a really low move to include 1c1inst, which induces the pathological behavior in those measures. (Purity and entropy are known for assessing the homogeneity, but not the parsimony of the clustering. By contrast, a measure like NMI also favors low numbers of clusters.) Obviously a cluster will be pure if it contains only a single element. I know they’re bad, but they’re better than F-score.

In [44]: df
Out[44]:
f_score  purity  entropy
0     78.9    79.8     45.4
1     78.7    80.5     43.8
2     66.3    83.8     33.2
3     66.1    81.7     40.5
4     63.9    84.0     32.8
5     61.5    82.2     37.8
6     56.1    86.1     27.1
7     37.9    86.1     27.7

In [45]: df.corr(method='spearman')
Out[45]:
f_score    purity   entropy
f_score  1.000000 -0.874267  0.857143
purity  -0.874267  1.000000 -0.994030
entropy  0.857143 -0.994030  1.000000


Purity and entropy almost perfectly predict one another! F-score doesn’t agree to well with either of them, unfortunately. (Thus my initial derision toward this evaluation.) And when I include that awful 1c1inst baseline, the scores don’t change too much.

In [47]: df2 = df.append([[9.5, 100, 0]], ignore_index=True)

In [48]: df2.corr(method='spearman')
Out[48]:
f_score    purity   entropy
f_score  1.000000 -0.912142  0.900000
purity  -0.912142  1.000000 -0.995825
entropy  0.900000 -0.995825  1.000000


F-score looks like it agrees a little more. Entropy and purity agree just about as well as they already did. If those two measures are so happy, why put all your money on F-score?

#### The external

We see why when we compare these rankings to the results on the supervised tests.

System Rank
I2R 1
UMDN2 2
upv_si 3
MFS 4
UBC-AS 5
UOY 6
UofL 7
In [62]: df3
Out[62]:
f_score  purity  entropy  external
1     78.7    80.5     43.8         5
2     66.3    83.8     33.2         3
3     66.1    81.7     40.5         2
4     63.9    84.0     32.8         1
5     61.5    82.2     37.8         7
6     56.1    86.1     27.1         6

In [63]: df3.corr(method='spearman')
Out[63]:
f_score    purity   entropy  external
f_score   1.000000 -0.714286  0.714286 -0.371429
purity   -0.714286  1.000000 -1.000000 -0.028571
entropy   0.714286 -1.000000  1.000000  0.028571
external -0.371429 -0.028571  0.028571  1.000000


Ah, F-score: the redemption. On the small set of systems here, there was no indication of supervised performance correlating with purity or entropy; it’s all double dutch. F-score actually does…worth mentioning, though not great.

To evaluate the clusterings with other metrics would be nice because they’re actually used outside of just information retrieval. Then again, this may make us the fabled “well-meaning, but misguided scientist successfully adapting her/his experiment until the data shows the hypothesis she/he wants it to show”. It’d also be nice to evaluate with a significance bootstrapping process—everyone loves error bars.

## Closing thoughts

I’d really like to evaluate these clusters with a known, effective measure like AMI or ARI—ones that encourage parsimony. That would mean engineering all of the shared task systems or asking the original authors for decades-old code. Neither is very likely. But the intrinsic evaluation measures used here (while easy to interpret, or familiar) are dated and have pathological flaws. People don’t use these to benchmark clustering algorithms, so they shouldn’t use them in other extrinsic cluster eval situations.

EDIT: And to top it off, I found a recent paper that cites the flaws-in-F-score paper above, and then proceeds to just use F-score anyway. It cites clustering literature, but only for definitions of basic things like “network”.

## References

1. Eneko Agirre and Aitor Soara. 2007. Semeval-2007 Task 02: Evaluating Word Sense Induction and Discrimination Systems. In Proceedings of the 4th International Workshop on Semantic Evaluations, SemEval ’07, pages 7–12.

# You can't just make everything a hyperlink, Arya.

I get asked from time to time: Why do I include so [redacted] many links in pages on my blog? This blog’s about page, my StackOverflow profile, even my CV…they’re full of links to other people’s cool stuff. It’s a public service, really. I’m such a giver.

The reason is twofold:

2. It helps the people I cite.

Adding a boatload of links to a page means you, dear reader, can get to the content you want. It also makes the page easier for one to read.

### Connecting

If someone is flipping through my CV and has never heard of the Robert S. Hyer Society or the Harvard–Amgen Scholars Program, there’s a handy, unobtrusive link in the PDF that they can click or tap—just like on a webpage.

Including links is also a great way to help readers call bullshit. If you make a grandiose claim, you’d better back it up. I take this seriously in my CV. If I give a presentation, I link to it. If I win an award, I link to it. Forget fake news. You can verify it yourself, and I make it easy by pointing to people better-known than I am.

### Interpretability

English is notorious for this thing where stringing nouns together gives you a new, modified noun. (German does this, too, but better. They drop the spaces in between. Informally, I’d like to see English compounds like summerprogram.) Donald Knuth sees this as a problem:

Resist the temptation to use long strings of nouns as adjectives: consider the packet switched data communication network protocol problem.

And you wouldn’t be so bold as to disagree with him, would you?

But when there’s that magic blue underline, you know the words belong in a group together. (A+ example: up above, the words “about page”.) This makes it easier to mentally parse, and not sending you down a garden path makes it easier to read.

This places a burden on the linker: make sure what you’re underlining fits under one grammatical tag. Earlier in this section, I underlined the entire verb phrase “sees this as a problem”. You’d tweak your eyebrow a bit if I had skipped underlining “sees”.

## Helping people I cite

It’s simple: the rich get richer. If a person or website has been helpful or important to me, a small act of repayment is attribution. (In fact, academics live and die by citation.) In linking to someone else’s resources, I make them more visible. And that’s good for them.

I don’t mean to suggest some massive viewership for this blog. No, citation helps people by the magic of Google. (Or Bing. I use Bing because Bing pays money. Is it worth the hours lost from inferior academic search results? Compare Bing vs Google.)

### PageRank

PageRank was what made Google a hit. Named after Larry Page (and not a webpage), it ranks results based on the reputation other pages give them.

There’s some basic linear algebra that goes into how the weights come about, but the gist is a story of cumulative advantage. Pages with more links to them get more reputation, and that means they (a) show up higher in lists, and (b) have more impact on pages they link to. See how in the figure, C is different from the handful of other pages that link to B because super-popular B links back to C. B is claiming that there’s something worthwhile in C, and because of B’s high reputation, the claim carries a lot of weight.

I see myself as one of the purple dots, though. If I link to a E and make it more visible, then I make their reputation (and weight) greater. That’s how pointing to things and people I like helps them out. It makes them more powerful influencers. With more visibility, they can do more.

It’s a golden-rule sort of thing (or at least, a categorical imperative): link unto others as you would have them link unto you. If you’re providing helpful content, people help both you and strangers by linking.

# A subtle syntax distinction in Python

Sophomore year, I thought my professor was archaic for seriously discussing FORTRAN as a language, but Lawrence Livermore National Lab still uses it because the numerical routines written in it have been battle-tested for decades. Python makes things better by wrapping those routines in simple syntax, batteries included. But there are some kinks, too. I determined a subtle and damaging flaw in a major Python library and corrected it.

I’m a contributor to networkx, the most successful graph library for Python. We use Travis for continuous integration (CI), making sure that each submitted change passes all tests before incorporating it into the body of code. And we’d had trouble lately: one function’s result was based on random decisions, which meant it could return one of two possible values.

It took us a few days of wrongly failing builds to nail down the problem, coming from that one test. We resolved that there were two right answers—two possible partitions of the graph. One split the graph into two components, and the other simply gave one large component.

Originally, the test checked this:

ground_truth = set([frozenset([0, 1, 2, 3, 4, 5, 6, 7]),
frozenset([8, 9, 10, 11, 12, 13, 14, 15])])
assert_equal(result, ground_truth)


But ground_truth wasn’t necessarily correct, so we had do incorporate the other possibility. A fix was presented:

assert(result in [ground_truth, set(frozenset(range(16)))])


By flipping all its coins the right way, the build passed. But passing wasn’t enough. Ten points to whoever already sees the problem.

Sets got strapped into Python later in its life. Frozensets are their immutable cousin. Calling set(frozenset(x)) is consequently the same as just calling python set(x)—it’s a typecast. Instead, we need set([frozenset(x)])—expressing a one-element set—but our syntax is looking rough. The cleanest way is some syntactic sugar introduced in Python 2.7: a set literal. The set literal syntax {foo} makes us less likely to make the mistake shown above. The easiest way, then, to express what we’re after is this:

assert(result in [ground_truth, {frozenset(range(16))}])


When I submitted this pull request to the repository, I took out a nasty bug that had mis-marked a number of good PRs as failing. Knowing which builds properly do and don’t pass means less time manually sorting that out—the entire point of continuous integration. This means networkx can move forward in a more streamlined, complete way.

# OpenCycle Fertility Planning

Children are a long way off for me, given that I don’t know where I’ll wind up in a year. For many US families, they’re too far off, though. The best fertility planning out there still isn’t good enough, so we’re applying machine learning to the problem.

We want a data-driven family planning technique that relies only on a simple, at-home measurement. The conventional guidance is called “three-over-six”: when a woman’s minimum basal body temperature (BBT) over the past three days has exceeded the maximum for the six days before, we can be reasonably sure that ovulation occurred three days ago. While BBT remains popular because results can be interpreted at home (rather than in a lab), the measurement is difficult. Movement raises BBT beyond the range of error, so a woman must be very still when measuring BBT, first thing in the morning. Naturally, there are a lot of missing data points, and there are likely also false data points from women who mis-measured or recorded only a guess.

Even if a woman does everything right, the three-over-six rule tells us that ovulation happened three days ago. Because fertility is at its peak during ovulation and declines afterward, couples seeking to have children lose three good days.

We’re working to use machine learning with an existing dataset of recorded menstrual cycle data to forecast the rise in BBT that signals ovulation. This gives more time to families who seek to have children—and limits risk for those who do not. We will explore time series models, deep learning, and various cross-validation techniques to ensure statistical validity of our results. As the data are already collected, we will not need to involve human subjects.

Because access to the data for this project is heavily restricted, I’m keeping the Github repo private for now. But anonymized graphs are definitely on the horizon. Expect lots of violin plots.

Update 2017-03-15: The repo is no longer private, so check it out. Incidentally, this work is conducted as part of the SMU Ubiquitous Computing Lab.

# Election Studies

What are the two hot topics of 2016? Celebrity deaths and elections. I’ll admit that I caught the bug, too. I’m working with two professors on dynamical (think differential equations) models of the American electorate.

A hallmark of twenty-first century American politics is that American voters have become “sorted” such that political views are now strong predictors of one another. Nowadays, one’s stance on birth control is more informative about one’s stance on gun control than would have been the case a generation ago. A Pew survey conducted in 2014 shows that among the politically active, uniformity of views within one’s party has grown. Why has this happened?

### Political Considerations

One unique feature of American politics is our two-party system, and the significant obstacles faced by any putative third party in our winner-take-all election system. Nevertheless, this system is as old as our nation itself, so other factors must be at play. Candidate mechanisms frequently encountered in the popular media include the proliferation of cable news networks with specific ideological slants, and the rise of social media, both of which can encourage an “echo chamber” effect in which one’s views are never challenged, but only reinforced. Sure, now you hear about it on the news all the time. But you didn’t when I wrote the fellowship application funding this work.

### Mathematical Considerations

These theories imply a dynamic relationship between voters, news organizations, and political party leadership, which recalls the mathematical concept of a “dynamical system”— a collection of inter-dependent variables governed by differential equations. Moreover, the increasing levels of polarization in the past two decades recall the mathematical concept of an “instability”—a self-reinforcing shift away from an equilibrium (i.e., the prior heterogeneous distribution of opinions). Finally, the transition away from heterogeneous, non-correlated opinions to a tightly-sorted set of opinions recalls the concept of a “bifurcation”—a fundamental structural change in the stability properties of a dynamical system induced by changes in an underlying system parameter.

### Methodology

As a starting place, one can picture each issue as an axis in a multidimensional vector space, and an individual’s total political position as a point in this space. Over time, the “sorting” described above should manifest itself as a coalescence of these points into two distinct clusters: one for each political party. Our first task in understanding the process will be to quantify the sorting using Principal Component Analysis, which can quantify the extent of the sorting over time, as well as identifying the dominant correlations among different opinions. Then, after a literature review of dynamical systems applied to held beliefs (like political stances), we will create models attempting to capture the sorting effect, run them using computer simulations, and compare the output to collected polling data. This process can be repeated to “tune” the models to fit the data. The hope is that the tuned model will reveal a fundamental system dynamic that sheds light on the causes of increased polarization in recent years.

Some novel stuff we’re doing:

• A new reordering criterion for correlation matrices, based on the singular value decomposition.
• Using Jupyter notebooks instead of just scripts. “Literate programming” dates back to Knuth.

# Quickly Finding Reliable Communication Backbones

SMU’s algorithm engineering class has been facetiously called a “course in computational creativity.” This creativity was stretched into graph theory for the class’s final project: fast algorithms for graph coloring and finding communication backbones in wireless sensor networks.

My implementation of the smallest-last ordering for greedy coloring was my first contribution to Python’s networkx graph library.

With the increasing prevalence of IoT devices and Bluetooth beacons, finding reliable communication pathways is becoming a critical problem. Interference, contention, and routing each become complicated problems when wireless sensors are randomly distributed. This motivates the investigation of communication backbones which can serve as trunk lines for the communication.

Large-scale trunk lines for communicating in the USA

A model of the wireless sensor network is the random geometric graph (RGG). Nodes in the graph are connected if they are within a given radius r, which meshes with our intuitive understanding of wireless signal range. A really approachable primer on wireless sensor networks and RGGs is here.

A minimal (cheap) backbone communicates with as many nodes as possible, while including as few nodes as possible. A trick for finding the backbone is using the graph coloring—putting the nodes into as few buckets as possible, with no adjacent nodes in the same bucket. The buckets are independent sets. A single independent set in an RGG tries to cover as much of the space as possible, without letting the nodes communicate. Merging two independent sets lets them serve as relays for one another.

Nodes marked in red form the backbone with the greatest coverage, and black edges connect the backbone. The red edges connect a single node to all of its neighbors.

While graph coloring is an NP-complete problem, numerous heuristics have been proposed. My thesis advisor, David W. Matula, presented the smallest-last ordering. It repeatedly removes a node of least degree until the graph is empty. Along the way, you wind up with a terminal clique: every node left has the same degree, because they’re all connected to one another. The graph will need at least as many colors as the terminal clique size.

The terminal clique is discernible at the far left of the graph, where the slope is consistent.

Along the way, I learned some pretty cool tricks for speeding up the implementation:

• Fast random point generation on a sphere—avoids rejection sampling (which is not guaranteed to terminate) and transcendental functions. It produces a 25% speedup over the next best method.
• k-d trees for quickly determining the edges present in a random geometric graph. It beats the O(n^2) brute-force algorithm.
• Speeding up the smallest-last ordering by using a bucket queue data structure optimized by tracking the lower bound on degree. The naive algorithm is O(n^2); this is O(n+m). I added my to networkx, so I anticipate seeing it show up in future students’ projects without attribution. Shrug.
• Computing faces on the sphere using the Euler characteristic, instead of by counting.

My full writeup for the project is here, and the code is here.