We all pick our hill to die on
And mine is evaluation metrics. Don’t be obstinate in choosing them, and don’t report only the ones that satisfy your hypotheses.
I’m picking this fight ten years too late as a project on word sense induction (WSI) has led me to a pivotal paper on evaluating WSI—a pivotal paper which irks me.
The work is Agirre & Soara (2007). They performed an unsupervised evaluation of WSI as a clustering problem. My mind leaps to the popular measure normalized mutual information (NMI) or its improvement adjusted mutual information (AMI). Granted, AMI was unveiled in 2009, so it’s fair that AMI wasn’t picked. But the choices they did make were weird.
The chosen measures were F-score, entropy, and purity. And the authors decided that F-score would be their golden boy. The others are merely included “for the sake of completeness.”
- Purity: how much do clusters contain elements of just one class? Higher ↔ better
- Entropy: how…mixed-up…are the classes and clusters? Lower ↔ better; suggests more agreement between classes and clusters
- F-score: balances precision and recall. Popular in NLP. Weird here, and demonstrably flawed. Higher ↔ better.
Let’s look at how the scores agree with one another. We’ll use a special correlation coefficient called Spearman’s rho ($\rho$), designed to compare two sets of rankings. Like Pearson’s $r$, values close to 0 mean no signal, and values close to ±1 have high agreement. (A $\rho$ of -1 would suggest that one rank is the reverse of the other.)
Systems are ordered by their F-scores in the paper, and we stick to that convention. Anything with a star beside it is a baseline.
Also, it seems like a really low move to include 1c1inst, which induces the pathological behavior in those measures. (Purity and entropy are known for assessing the homogenieity, but not the parsimony of the clustering. By contrast, a measure like NMI also favors low numbers of clusters.) Obviously a cluster will be pure if it contains only a single element. I know they’re bad, but they’re better than F-score.
In : df Out: f_score purity entropy 0 78.9 79.8 45.4 1 78.7 80.5 43.8 2 66.3 83.8 33.2 3 66.1 81.7 40.5 4 63.9 84.0 32.8 5 61.5 82.2 37.8 6 56.1 86.1 27.1 7 37.9 86.1 27.7 In : df.corr(method='spearman') Out: f_score purity entropy f_score 1.000000 -0.874267 0.857143 purity -0.874267 1.000000 -0.994030 entropy 0.857143 -0.994030 1.000000
Purity and entropy almost perfectly predict one another! F-score doesn’t agree to well with either of them, unfortunately. (Thus my initial derision toward this evaluation.) And when I include that awful 1c1inst baseline, the scores don’t change too much.
In : df2 = df.append([[9.5, 100, 0]], ignore_index=True) In : df2.corr(method='spearman') Out: f_score purity entropy f_score 1.000000 -0.912142 0.900000 purity -0.912142 1.000000 -0.995825 entropy 0.900000 -0.995825 1.000000
F-score looks like it agrees a little more. Entropy and purity agree just about as well as they already did. If those two measures are so happy, why put all your money on F-score?
We see why when we compare these rankings to the results on the supervised tests.
In : df3 Out: f_score purity entropy external 1 78.7 80.5 43.8 5 2 66.3 83.8 33.2 3 3 66.1 81.7 40.5 2 4 63.9 84.0 32.8 1 5 61.5 82.2 37.8 7 6 56.1 86.1 27.1 6 In : df3.corr(method='spearman') Out: f_score purity entropy external f_score 1.000000 -0.714286 0.714286 -0.371429 purity -0.714286 1.000000 -1.000000 -0.028571 entropy 0.714286 -1.000000 1.000000 0.028571 external -0.371429 -0.028571 0.028571 1.000000
Ah, F-score: the redemption. On the small set of systems here, there was no indication of supervised performance correlating with purity or entropy; it’s all double dutch. F-score actually does…worth mentioning, though not great.
To evaluate the clusterings with other metrics would be nice because they’re actually used outside of just information retrieval. Then again, this may make us the fabled “well-meaning, but misguided scientist successfully adapting her/his experiment until the data shows the hypothesis she/he wants it to show”. It’d also be nice to evaluate with a significance bootstrapping process—everyone loves error bars.
I’d really like to evaluate these clusters with a known, effective measure like AMI or ARI—ones that encourage parsimony. That would mean engineering all of the shared task systems or asking the original authors for decades-old code. Neither is very likely. But the intrinsic evaluation measures used here (while easy to interpret, or familiar) are dated and have pathological flaws. People don’t use these to benchmark clustering algorithms, so they shouldn’t use them in other extrinsic cluster eval situations.
EDIT: And to top it off, I found a recent paper that cites the flaws-in-F-score paper above, and then proceeds to just use F-score anyway. It cites clustering literature, but only for definitions of basic things like “network”.
- Eneko Agirre and Aitor Soara. 2007. Semeval-2007 Task 02: Evaluating Word Sense Induction and Discrimination Systems. In Proceedings of the 4th International Workshop on Semantic Evaluations, SemEval ’07, pages 7–12.