We all pick our hill to die on

And mine is evaluation metrics. Don’t be obstinate in choosing them, and don’t report only the ones that satisfy your hypotheses.

I’m picking this fight ten years too late as a project on word sense induction (WSI) has led me to a pivotal paper on evaluating WSI—a pivotal paper which irks me.


The work is Agirre & Soara (2007). They performed an unsupervised evaluation of WSI as a clustering problem. My mind leaps to the popular measure normalized mutual information (NMI) or its improvement adjusted mutual information (AMI). Granted, AMI was unveiled in 2009, so it’s fair that AMI wasn’t picked. But the choices they did make were weird.

The chosen measures were F-score, entropy, and purity. And the authors decided that F-score would be their golden boy. The others are merely included “for the sake of completeness.”

  • Purity: how much do clusters contain elements of just one class? Higher ↔ better
  • Entropy: how…mixed-up…are the classes and clusters? Lower ↔ better; suggests more agreement between classes and clusters
  • F-score: balances precision and recall. Popular in NLP. Weird here, and demonstrably flawed. Higher ↔ better.

Hoo boy.

A meta-evaluation

Let’s look at how the scores agree with one another. We’ll use a special correlation coefficient called Spearman’s rho ($\rho$), designed to compare two sets of rankings. Like Pearson’s $r$, values close to 0 mean no signal, and values close to ±1 have high agreement. (A $\rho$ of -1 would suggest that one rank is the reverse of the other.)

System F-score Purity Entropy
1c1word* 78.9 79.8 45.4
UBC-AS 78.7 80.5 43.8
ubv_si 66.3 83.8 33.2
UMND2 66.1 81.7 40.5
I2R 63.9 84.0 32.8
UofL 61.5 82.2 37.8
UOY 56.1 86.1 27.1
Random* 37.9 86.1 27.7
1c1inst* 9.5 100.0 0.0

Systems are ordered by their F-scores in the paper, and we stick to that convention. Anything with a star beside it is a baseline.

The meat

Also, it seems like a really low move to include 1c1inst, which induces the pathological behavior in those measures. (Purity and entropy are known for assessing the homogenieity, but not the parsimony of the clustering. By contrast, a measure like NMI also favors low numbers of clusters.) Obviously a cluster will be pure if it contains only a single element. I know they’re bad, but they’re better than F-score.

In [44]: df
   f_score  purity  entropy
0     78.9    79.8     45.4
1     78.7    80.5     43.8
2     66.3    83.8     33.2
3     66.1    81.7     40.5
4     63.9    84.0     32.8
5     61.5    82.2     37.8
6     56.1    86.1     27.1
7     37.9    86.1     27.7

In [45]: df.corr(method='spearman')
          f_score    purity   entropy
f_score  1.000000 -0.874267  0.857143
purity  -0.874267  1.000000 -0.994030
entropy  0.857143 -0.994030  1.000000

Purity and entropy almost perfectly predict one another! F-score doesn’t agree to well with either of them, unfortunately. (Thus my initial derision toward this evaluation.) And when I include that awful 1c1inst baseline, the scores don’t change too much.

In [47]: df2 = df.append([[9.5, 100, 0]], ignore_index=True)

In [48]: df2.corr(method='spearman')
          f_score    purity   entropy
f_score  1.000000 -0.912142  0.900000
purity  -0.912142  1.000000 -0.995825
entropy  0.900000 -0.995825  1.000000

F-score looks like it agrees a little more. Entropy and purity agree just about as well as they already did. If those two measures are so happy, why put all your money on F-score?

The external

We see why when we compare these rankings to the results on the supervised tests.

System Rank
I2R 1
upv_si 3
UofL 7
In [62]: df3
   f_score  purity  entropy  external
1     78.7    80.5     43.8         5
2     66.3    83.8     33.2         3
3     66.1    81.7     40.5         2
4     63.9    84.0     32.8         1
5     61.5    82.2     37.8         7
6     56.1    86.1     27.1         6

In [63]: df3.corr(method='spearman')
           f_score    purity   entropy  external
f_score   1.000000 -0.714286  0.714286 -0.371429
purity   -0.714286  1.000000 -1.000000 -0.028571
entropy   0.714286 -1.000000  1.000000  0.028571
external -0.371429 -0.028571  0.028571  1.000000

Ah, F-score: the redemption. On the small set of systems here, there was no indication of supervised performance correlating with purity or entropy; it’s all double dutch. F-score actually does…worth mentioning, though not great.

To evaluate the clusterings with other metrics would be nice because they’re actually used outside of just information retrieval. Then again, this may make us the fabled “well-meaning, but misguided scientist successfully adapting her/his experiment until the data shows the hypothesis she/he wants it to show”. It’d also be nice to evaluate with a significance bootstrapping process—everyone loves error bars.

Closing thoughts

I’d really like to evaluate these clusters with a known, effective measure like AMI or ARI—ones that encourage parsimony. That would mean engineering all of the shared task systems or asking the original authors for decades-old code. Neither is very likely. But the intrinsic evaluation measures used here (while easy to interpret, or familiar) are dated and have pathological flaws. People don’t use these to benchmark clustering algorithms, so they shouldn’t use them in other extrinsic cluster eval situations.

EDIT: And to top it off, I found a recent paper that cites the flaws-in-F-score paper above, and then proceeds to just use F-score anyway. It cites clustering literature, but only for definitions of basic things like “network”.


  1. Eneko Agirre and Aitor Soara. 2007. Semeval-2007 Task 02: Evaluating Word Sense Induction and Discrimination Systems. In Proceedings of the 4th International Workshop on Semantic Evaluations, SemEval ’07, pages 7–12.
Written on December 6, 2017