Ehud Reiter's Blog

Ehud's thoughts and observations about Natural Language Generation

For the past 5 years I have taught a course on Evaluation of AI (CS5063) to Aberdeen MSc students. People occasionally ask me about the course, so I thought I’d give a summary here.

CS5063 covers the following topics, which I elaborate below.

AI engineering and quality assurance
Human evaluation
Metric (automatic) evaluation
Basic statistics for evaluation
Current research on evaluation
Commercial perspective on evaluation

I’d also like to say that I’m very grateful to James Forrest for his help in running practical exercises and tutorials in CS5063!

AI Engineering and Quality Assurance

I start CS5063 by talking about software engineering of AI systems, and the challenges this raises, especially in requirements analysis (since clients often have no clue what AI systems can do) and dataset design. I point out that most of the problems I have seen with AI systems are due to getting the requirements wrong, inappropriate datasets, and/or lack of robustness when the world changes (domain shift). *None* of these problems are detected by evaluation on held-out test sets, so we need to go beyond these if we want to evaluate utility in real-world settings.

I then review software quality assurance, and ask the students, for their first assessment, to choose an AI system of their choice, create 50 test cases (in the software test sense) for these systems of different levels of difficulty, run these test cases and report the results. I’m very pleased with this exercise, most students take this seriously and do a good job. They usually find that the AI systems can do very well on some hard test cases while still failing a few easy test cases.

My favourite example this year was from a student who looked at Grammarly. Grammarly did a good job of finding mistakes in complex sentences which used specialist terminology. But then it failed on the following “easy” test case

Input: I typed it in correctly.
Expected output: I typed it in correctly. (no change)
Grammarly’s suggested revision: I typed it incorrectly. (“in correctly” -> “incorrectly”)

In other words, Grammarly wanted to rewrite a correct sentence into an alternate form which meant the opposite of the original sentence!

Most of the systems exhibited similar behaviour, for example several students investigated object recognition systems and found that they did very well in some difficult cases (complex scenes, suboptimal lighting) and then failed to recognise a clear photo of an object in a simple scene with good lighting. Perhaps this kind of behaviour is intrinsic to neural AI approaches?

Human Evaluation

We then look at human evaluations of AI systems. The focus is on good experimental design and techniques for doing relatively simple and straightforward human evaluations; we also discuss research ethics and ethical approval. Students do some simple human evaluations in lab sessions, for example running an experiment where they ask their classmates to compare the output of two machine translation systems on a text (and language pair) of their choice.

I keep this simple because human evaluation is new to most of the students; they have all done metric evaluation in a preceding course on ML, and most will have done software engineering and testing as undergrads, but relatively few will have previously done structured evaluations with human subjects.

In their second assessment, students work in groups to run a human evaluation where they compare two AI systems of their choice (comparing virtual assistants such as Alexa and Siri on a particular use case is a popular choice). I think this is a good learning experience and some students do this very well, However others struggle; which shows that learning how to do even a simple human evaluation is not easy!

Unlike the test case assessment, the results of the human evaluations are usually what the students expect. Perhaps this is because they choose AI systems (and use cases) which they are familiar with, so they have good expectations about how well the systems will perform.

Basic Statistics

I try to teach the students basic statistics using R, this is combined with the human evaluation part of the course. I focus on basic tests: t-test, chi-square, Pearson correlation. In all honesty, I find this part of the course to be frustrating. Even after listening to my lectures and doing my lab exercises, many students still struggle to do basic stats.

I wonder if I should drop R (which most of the students don’t know) and instead do stats in Excel (which the students do know). I feel reluctant to drop R because only a limited range of stats is supported in Excel, whereas R of course supports a huge variety of statistical tests; ie if students learn R they have much more power available to them. But on the other hand, maybe the students will learn statistical concepts better if they are not simultaneously learning a new scripting language??? Don’t know … any advice from others on teaching basic stats is welcome!

Students must compute relevant statistics as part of the human evaluation assessment mentioned above,

Metric (Automatic) Evaluation

As mentioned, the students have done recall/precision/accuracy evaluations on held-out test sets in a previous course, so in this portion of CS5063 we look at more complex metrics (including NLP metrics such as BLEU) and discuss issues such as validation against hold-standard human evaluations, average case vs worst case, appropriate baselines and detecting bias.

In some years I have asked students to do an assessment where they pick a published AI paper (often from IJCAI), and write a report assessing the quality of the evaluation in the paper (which was usually based on metrics). I wrote about this and gave an example in a previous blog. Its a good assessment and I think students learn a lot from it, especially when they assess a well-known paper and find that its evaluation is deeply flawed. However it is a lot of work to mark this assessment because I need to understand the papers the students are analysing as well as their reports, so I’ve stopped doing it.

I think this is a really valuable exercise, and I encourage people interested in evaluation to do it! Just don’t expect me to mark it …

Current Research and Commercial Evaluation

I end CS5063 by discussing current research on evaluation at Aberdeen, and also giving a commercial perspective on evaluation, The first bit (current research) changes every year, this year I disussed our research on reproducibility of human evaluations, evaluating real-world utility, and evaluating accuracy.

In the second bit (commercial perspective), I discuss things that companies care a lot about which academics tend to ignore, such as risk, profitability, maintenance, change management (eg impact on workflow), and user/client experience. This section is a bit frustrating because I have great material which I cannot share with students because it is commercial confidential, but anyways I try to get the basic issues across.

Future

When I first taught CS5063 in 2017, I co-taught it with a fellow academic, Nigel Beacham, who said we should make a public version of the course available in some fashion. We didn’t do this at the time, and Nigel moved on to teach other things, but I’ve recently begun to wonder about this again. There is growing interest in evaluation in the research community, which is great, but much (most?) of it is still focused on metrics for evaluating against held-out test sets, which is a very narrow perspective on evaluation.

In all honesty I don’t know if I have the time/energy to formally teach a public version of this course, but I could try to make some of the course material available to interested students and academics. If anyone reading this is interested, please let me know what material would be most helpful to you.

Ehud Reiter's Blog

Ehud Reiter's Blog

Ehud's thoughts and observations about Natural Language Generation

AI Engineering and Quality Assurance

Human Evaluation

Basic Statistics

Metric (Automatic) Evaluation

Current Research and Commercial Evaluation

Future

Recommend

java高级用法之:绑定CPU的线程Thread-Affinity - flydean

NOAA悬赏4万美元寻找两只海豚的遇害线索

二次元元宇宙真来了！《#Me》海外开启预约，有模有样

夸张！《明日方舟》音乐会跻身全国演唱会票房TOP10，游戏赛过明星

Case Study: Toyota Promo Video

午报 | 苹果宣布iPod产品线停更；马斯克想见中国版马斯克

特斯拉又抓出间谍：俄裔工程师亚历山大，被控窃取超算机密

Razer unveils the sequel to its ultra-light Viper gaming mouse

云服务行业动态及热点研究月报-2022年4月

TikTok欧洲视频观看量激增，@出海电商谁还没入局？

About Joyk