

Why doesnt BLEU work for NLG?
source link: https://ehudreiter.com/2018/07/02/why-bleu-poor-for-nlg/
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

Ehud Reiter's Blog
Ehud's thoughts and observations about Natural Language Generation
I recently wrote a paper on a structured review of the validity of BLEU, where I brought together evidence from previously published studies on how well BLEU correlates with human evaluations. One of my main conclusions was that BLEU was much better at evaluating MT systems than NLG systems. A few people have since asked me why I thought this was the case. Below are some thoughts; these are speculations rather than proven facts!
I should add that many of the papers I surveyed made similar points, including Espinosa et al 2010, Liu et al 2016, and Reiter and Belz 2009.
Text quality
MT systems are getting better, but the output of a good MT system is still inferior to a human translation. NLG systems, in contrast, typically aim to produce texts of near-human, or even better-than-human, quality (eg, Reiter et al 2005). This is partially because there is little interest in using NLG to produce moderate quality texts, since these can be generated using templates.
BLEU is based on comparing computer-generated texts to human-written “reference” texts, and assumes that the closer the computer text is to the reference text, the better. This assumption is clearly incorrect if the computer-generated texts are *better* than the human-written reference texts! More generally, I suspect that any metric which is based on comparing computer-generated texts to human-written texts will be dubious if the computer texts are of near-human as well as better-than-human quality.
Text variability
Information can be expressed in many different ways by an NLG system. To take a very simple, the below are all acceptable ways of describing a “purchase” event
Yesterday John bought a book at the bookstore.
John purchased a book at the bookstore yeserday.
The bookstore sold John a book on 1 July.
(etc)
So even with this very simple message, we can express it in many ways by changing modifier (“yesterday”) placement, replacing words with synoyms (“bought” and “purchased”), changing temporal reference strategy (“yesterday” vs “1 July”), and paraphrasing (“John bought” vs “The bookstore sold”). So even this simple message can be expressed in dozens of ways. And a narrative which communicates ten messages can probably be expressed in thousands (millions?) of different ways.
This is a problem for BLEU, since it effectively is looking for matching ngrams in generated and reference texts. Even if multiple reference texts are provided, they are unlikely to cover all or even most of the above variations.
An obvious question is why this isnt also an issue for MT; after all, there are many acceptable ways of translating a sentence. I dont have a good answer to this, although I wonder if BLEU’s bias against rule-based systems is partially because their output is more variable than statistical/neural systems?
Variation to keep text interesting
In many contexts, human readers want texts to be varied; they do not want to see the same words and syntactic constructs repeated again and again. Hence varying the way information is communicated is appreciated by human readers, and increases their satisfaction; this is also standard advice to human writers. However, such variation *decreases* ratings from BLEU and other metrics, which tend to reward systems which are repetitive and use “preferred” wording and syntax 100% of the time.
I suspect this is a relatively minor issue compared to the previous ones, but I think it is interesting because it is a very clear example of a case where human preference is pretty much the opposite of BLEU’s preferences; systems that vary texts get higher human evaluation scores but lower BLEU scores.
Evolution
Being very speculative, I suspect that MT systems have evolved to have good BLEU scores, since a good BLEU score is very important for research success in MT; I mean this in the Darwinian sense that approaches that provide good BLEU scores get more publications and funding than approaches with poor BLEU scores, regardless of their respective human evaluations. This one of the reasons why BLEU-human correlations for MT systems have increased over time. Good BLEU score has been much less important in NLG, so hence there has been less “evolutionary pressure” in NLG in favour of approaches that lead to poor BLEU scores.
Other ideas?
If readers have other suggestions as to why BLEU is poorly suited to evaluating NLG systems, please let me know (or add a comment to this blog); I’m very interested in knowing other people’s thoughts on this!
Recommend
-
15
Linux doesn't have Photoshop. One common rant I get to hear when I try to help someone to switch to Linux. It is almost 2020 and say what? people still agree with this statement. Of course, they are right, cau...
-
33
BLEU — Bilingual Evaluation Understudy A step by step approach to understanding BLEU, a metric to understand the effectiveness of Machine Translation(MT) What will you learn in this post?
-
7
Ehud Reiter's Blog Ehud's thoughts and observations about Natural Language Generation The BLEU metric was int...
-
8
Uncategorized Small differences in BLEU are meaningless Last week we read an excellent paper in our reading group: Tangled up in BLEU: Reevaluating the Evaluatio...
-
11
Image Caption 评价标准——BLEUJune 12, 2017在 Image Caption 任务中,几种评价算法被用来度量预测结果(candidate caption)与 label (reference captions)的差异,这个系列主要介绍这几种评价标准。 I_iI\_{i}I_i 为图像 iii
-
9
Uncategorized BLEU in Different Languages: Dont use it for German I recently wrote a paper on a
-
4
Background I have finished my structured review of the validity of BLEU (which reviews
-
8
How do I fix...
-
6
Intel 4004 emulator doesnt work Back to General discussions forum ...
-
4
nitin12384's blog Need Help in an...
About Joyk
Aggregate valuable and interesting links.
Joyk means Joy of geeK