Ehud Reiter's Blog

Ehud's thoughts and observations about Natural Language Generation

Craig Thomson and I have just finished running a shared task on evaluating accuracy (ie, finding factual mistakes) in texts produced by data-to-text neural NLG systems. The shared task will be presented at INLG 2021, and our summary paper is on arxiv, with datasets on Github. I think finding accuracy errors is a very important and interesting task, and I encourage other people to “have a go” using the data on the Github site!

Anyways, the shared task gave us insights about both the mistakes made by neural NLG systems, and also mistakes which were hard to detect by neural evaluation techniques. Among other things, we saw that neural NLG systems struggled with some words which have fairly clear rule-based definitions. I give some examples below.

The Task

First of all, I should explain the shared task. The goal was to find factual errors in summaries of basketball games which were produced from basketball box score data by neural NLG systems. Below is an extract from such a text, which has been manually annotated for factual errors (full details are given in a previous blog). Errors are underlined. The data for this game is available on basketball-reference.com.

The Memphis Grizzlies (5-2) defeated the Phoenix Suns (3 – 2) Monday 102-91 at the Talking Stick Resort Arena in Phoenix. The Grizzlies had a strong first half where they out-scored the Suns 59–42. Marc Gasol scored 18 points, leading the Grizzlies. Isaiah Thomas added 15 points.

This example shows different types of errors

Incorrect numbers: For example 59–42 should be 46-52.
Incorrect names: For example, Talking Stick Resort Arena should be US Airways Center
Incorrect word: the Grizzlies did not out-score the Suns
Context error: Isaiah Thomas played for the Suns, but the above contextually implies he played for the Grizzlies

Participants in the shared task were given 60 manually annotated texts for training and development; we held back a test set of an additional 30 texts. Texts were around 300 words long on average, and contained 20 errors on average. Which (if I put on my “commercial” hat) is far too high for a real-world sports journalism application!

In the rest of this blog, I will focus on incorrect word errors. The others are also interesting, you can learn more about them in our paper.

Example: led

The most common incorrect word error in the training set was “led”. “Led” is interesting because it can be used in many different ways (“the team led at the half”, “player X led his team”, etc) and also its meaning can sometimes be fuzzy or vague. For example when comparing two players A and B, if A scored slightly more points than B but B had many more rebounds and assists, we might say that play B “led” the team.

Because of this fuzziness, we hoped that neural NLG systems could learn how to appropriately use the word. But this was not the case, our systems made many mistakes, many of which were blatant (eg saying that a team “led at the half” when it was behind).

Example: double-double

The second most common incorrect word error in the training set was “double-double”. A double-double occurs when a basketball player has ten or more (double-digits) in exactly two of the following categories: points, rebounds, assists, steals, and blocks. Note that if a player has ten or more in three of the categories, this is called a triple-double (3 statistics in double-digits) rather than a double-double.

In any case, while double-double is easy to define via rules, it seemed to be a difficult concept for our neural NLG systems to learn.

Example: only other

The above examples refer to corpus texts. If we look at the submissions to the shared task, they struggled to detect certain kinds of errors, including the use of “only other” in statements such as “The only other Net to reach double figures in points was Ben McLemore.” Note that this usage of “only other” suggests that (A) McLemore scored at least 10 points, (B) other Net players scored at least 10 points, and (C) all of these other Net players were previously mentioned in the text.

In other words, “only other” has a clear rule-based definition, but it is complex, and depends on what was previously mentioned in the text and the performance of other players as well as the performance of the player in question. This seems to be difficult for neural systems to learn.

Lexical choice

I once wrote a blog entitled Lexical Choice Needs Machine Learning, where I argued that word (lexical) choice in NLG should in part be learnt from data. I still believe this, but the above examples suggest that current neural NLG approaches are not sufficient for lexical choice. We need better ML approaches and/or to allow some words to be defined by rules. Perhaps there is a lesson here for other NLG tasks as well.

Difficult Words for Neural NLG Systems

Ehud Reiter's Blog

Ehud's thoughts and observations about Natural Language Generation

The Task

Example: led

Example: double-double

Example: only other

Lexical choice

Recommend

用require(msg.sender == tx.origin)限制合约调用者不能为合约

离线签名通过第三方节点公布交易msgsender()的值是私钥签名的地址吗

一图看透腾讯大佬们的做事方法论

【全职远程】30k-50k/硅谷初创公司招/嵌入式开发/中文友好

字节花了90多亿，让VR行业再次“跳动”

Create space for others

二维数组存数值5000就报错

这个是什么错误

okex怎么发行代币

终极提问：人生的意义究竟何在？

About Joyk