Uncategorized

Small differences in BLEU are meaningless

Last week we read an excellent paper in our reading group: Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics (https://www.aclweb.org/anthology/2020.acl-main.448.pdf), by Mathur, Baldwin, and Cohn. The paper makes a lot of good points, but one that really struck me was that small differences in evaluation metrics such as BLEU are probably meaningless. Which is striking since ACL and other “selective” and “prestigious” venues are happy to accept papers on the basis of small improvements in metric scores.

Only big differences in metric scores are meaningful in MT

Mathur et al use data from WMT, which is a long-running annual machine translation event where (amongst other things) a bunch of MT systems are evaluated using both human evaluations and metrics (including BLEU). Overall, metric scores usually correlate reasonably well with human evaluations in MT (with some caveats), which supports the use of metrics as proxies for human evaluation in machine translation (not in NLG!!).

Anyways, the authors look at WMT data, and point out that WMT evaluations include systems with very different quality levels. They then point out that the Pearson correlation used to compare human evaluations to metric evaluations is primarily driven by large differences and outliers. Ie, as long as a metric such as BLEU can reliably distinguish MT systems which people think are excellent from MT systems which people think are dreadful, then the metric will show a high Pearson correlation with human evaluations.

However, in academic contexts we are usually interested in small differences in quality (eg, is a proposed model slightly better than state-of-art), and Mathur et al show that BLEU is **not** good at predicting the result of human evaluations when the difference in BLEU scores is small. They essentially compute how well differences in BLEU scores of two systems predict differences in human evaluation, and conclude that

If System A has a BLEU score that is 1-2 point higher than System B (common in academic papers), then there is only a 50% chance that human evaluators will prefer System A over System B
If System A has a BLEU score that is 3-5 points higher than System B, there is a 75% chance that human evaluators will prefer A over B.
In order to get a 95% chance that human evaluators will prefer A over B, we need something like a 10 point improvement in BLEU (they dont state this, I am guessing this by eyeballing their graphs).

Mathur et al look at several other metrics as well, and find the same pattern. Across the board, a large difference in a metric score between two systems is probably meaningful (ie, if MT system A has a much higher metric score than MT system B, humans evaluators will probably rate A higher than B), but a small difference is not.

Inappropriate use of BLEU and other metrics

The reason this is a problem is that a lot (most?) academic papers in NLP justify that a proposed model or algorithm is better than state-of-the-art on the basis of quite small differences in metric scores. It is very rare, at least in my experience, to see a paper which shows a 10-point improvement in BLEU over state-of-the-art, which (as above) seems to be what you need in order to be 95% confident that your proposed model would genuinely be seen by users as an improvement over state-of-art.

In short, there are contexts in machine translation where BLEU and other metrics can serve as plausible proxies for human evaluation. However, the typical academic use of metrics (as above) is ***NOT*** one of these contexts; it is not scientifically valid to claim that a new model is better than state-of-the-art because of a small difference in metric score.

Wish list for the future

What I would love to see in the future is the following.

All metrics are carefully characterised so that we know when they reliably predict human evaluations and when they do not. In particular, there is clear guidance about how much of a difference in metric score is needed to give confidence that the systems being compared are truly different.
Researchers and paper authors only use metrics to justify claims when the criteria in (1) are met. Reviewers reject papers which make claims that are not justified under the criteria.
High-quality human evaluations are common and indeed expected for top-rank papers. In MT, the expectation is that such human evaluations will be at least as good as those done in WMT.

Perhaps I am an optimist, but I do think that we are slowly moving in the above direction. It will take time, but hopefully we will see real change over the next 5-10 years.

Small differences in BLEU are meaningless

Small differences in BLEU are meaningless

Only big differences in metric scores are meaningful in MT

Inappropriate use of BLEU and other metrics

Wish list for the future

Recommend

全屋智能

B站《后浪》、百度《你说啥》……盘点2020十大刷屏广告片！

那些思路清奇的增长方法……

三个思维方法，轻松搞定数据运营难题

产品福利课｜进大厂的机会来了，月薪2W-5W，字节急招这类人才 | 运营派

产品福利课 |“室友拿到了腾讯的offer，我难过了一整晚···” | 运营派

产品福利课｜这可能是你近两年，年薪20万+最好的机会了 | 运营派

运营管理课丨空降高管失败率达85%，原因主要在这三大点！

产品福利课｜HR：“找工作吗？一入职就后悔的那种···” | 运营派

ububtu server 14.04上，PHP5-FPM服务的启动方式变更

About Joyk