Using Amazon Mechanical Turk to crowdsource data on the quality of food images

In the last couple of years, we at Grubhub have actively focused on increasing the volume of images shown to our diners on restaurant menu listings. This has led to a drastically positive impact on diner conversion rates for our partner restaurants; however, with an increased volume of images being shown to diners came the need to monitor and improve the qualityof those images.

Assessing the quality of food images is a difficult problem because it is highly subjective. There are no clear guidelines that ensure a good quality food image and, without a decently-sized labeled dataset, it is hard to automate the process using artificial intelligence. So we decided to first focus on determining quality scores for a dataset of images using crowdsourcing, hoping that the data could further be used to train a custom model to rate an image on its quality using a cloud service such as Google vision AutoML. This blog outlines how we leveraged Amazon Mechanical Turk (MTurk) to get a quality score for images on the Grubhub platform.

MTurk is a service that has widely been used to crowdsource data for machine learning experiments and it was an obvious choice to use for this use case due to its seamless integration with the AWS cloud infrastructure we use at Grubhub. MTurk provides easy to use interfaces and APIs for posting your tasks (a.k.a Human Intelligence Tasks (HITs)), configuring them, viewing HITs in progress, and persisting results. And with a huge number of MTurk workers (~500K according to Amazon) registered on their platform, it was the easiest way to reach a massive crowd of online workers.

Using Amazon Mechanical Turk

Setting up a task on MTurk that provided accurate results was more of an empirical art in our experience, requiring some experimentation to get it right. Our first attempts used a single scale for each image, rating its overall quality as either “great”, “good”, or “bad”. We quickly put something together that we could show to workers, set a per-HIT price, and opened it up to a set of workers.

Initial experimental user interface for our HIT

Unfortunately, the results from our first experimental task were not up to the mark. So we looked at various ways to enhance them. To get better results, we looked at improving the user interface of the task, using clearer instructions for the task, paying the workers better, and filtering workers based on location, educational qualifications, and domain expertise related to food or photography.

User Experience

Having an appropriate and easy to use user interface (UI) for the workers to complete a task helps maximize their productivity. Workers want to work on tasks that they can complete quickly in order to increase their earnings.

With every iteration, we tried to design a UI that:

Is easy to use and minimalistic, so workers avoid distractions, thus increasing their productivity.
Has easily understandable and visible instructions to complete the task, expressed with few words and ample examples.
Fits on smaller laptop/monitor screens to reduce time required for scrolling.

You can see our first version was too busy, full of distractions. The instructions were hard to read and caused problems on a mobile screen.

Using clear instructions

One problem we had with the original task is that asking the workers to tag a given food image as “bad”, “good”, or “great” was too broad and gave us inconsistent results. To improve rating instructions, we added three facets that can be used to identify a high-quality food photo:

Plating: How well is the food arranged?
Lighting: How well is the light used to bring out the food’s good side?
Composition: How is the shot framed?

This more granular rating criteria, made it easier for the MTurk workers to rate the images accurately. To find out how to best provide instructions for this criteria, we ran experiments that asked workers to rate images based on the three facets we discussed above in three different ways.

1)Combined instructions: Here we provided instructions for plating, lighting, and composition and asked them to rate one image based on all three facets.

Experiment with rating instructions for all facets and ratings.

2)Illustrative instructions: The previous experiment was still too confusing, so we removed text and provided instructions only in the form of examples. We asked workers to rate one image on its plating, lighting, and composition.

Experiment where the differences between the three ratings were entirely based on example images.

3)Singular instructions: We also tested tasks that provided instructions for one facet and asked workers to rate an image based only on that facet. Based on the accuracy of these results, we found that this approach worked best for our task. Below is the layout of our task for an image and a facet (shown in sandbox mode) that an MTurk worker was presented:

Final design of the user interface where workers were provided separate instructions with wordings and example images for every facet individually.

As you can see, the instructions are clearer and more specific about what the worker should be looking for. That gave us better results on image quality; as a nice additional benefit, this granular data enables us to provide feedback to restaurants on how a food image could be improved.

Task parameter tuning

The MTurk requester UI provides an easy way to alter parameters for a task like the task name, description, tags to improve searchability of the task, the number of unique workers every task should be performed by, and worker qualifications and locations to ensure that human error is reduced. We performed experiments with a hand-curated and labelled dataset of images to tune these parameters for our task to ensure best possible results.

Studying similar MTurk tasks enabled us to agree on a fair compensation for the workers. MTurk also allows requesters to only have workers with certain expertise work on their tasks. But this smaller set of workers can lead to higher turnaround time and prices for tasks.

The following aspects can also have an effect on the task turnaround time:

The number of workers available: The number of workers available to complete a task reduces when we specify qualifications and region filters. Worker availability also changes according to the willingness of workers to take up the task based on the wage provided.
The complexity of the task: Some tasks like address verification, outlining and tagging of objects in pictures, etc. are complex tasks and increase the average time a worker will spend on each task.
User interface provided to the workers: A complicated user interface without clear instructions or one with a faulty user experience will reduce the number of workers willing to take up the task, increasing turnaround time for the task.
Searchability of the task on the worker MTurk application: The newest tasks show up at the top of the search results for workers. Hence those tasks will have more workers taking up the task, reducing their turnaround time. We therefore recommend that you re-create incomplete tasks well before they expire as that will improve their search visibility. Adding appropriate tags to the task helps with better search result placement.

Guarding against inaccurate results

While MTurk is an easy-to-use and reliable tool for crowdsourcing data, it is worth noting that every worker will have a different perspective on the task you provide. It is almost certain that you will have instances where the same task yielded different answers from multiple workers. It’s also possible that some workers, while trying to maximize the number of tasks they work on, do not pay attention to the quality of their work.

To tackle these issues, we collected rating answers from multiple workers for the same image and used an average of scores from all the workers as the quality rating for an image to reduce the effect of anomalies. MTurk also lets you obtain anonymized data on every worker who finished the tasks, which can then be used to weed out workers submitting unsatisfactory results repeatedly.

Automating the rating calculation process

The task layout was setup through the MTurk Requester user interface, but we automated the rest of the quality rating calculation process through daily cron jobs running on the Grubhub infrastructure like so:

Data collection: The data collector job runs once every day to aggregate all image assets uploaded onto the Grubhub platform the previous day and writes a dated file to an Amazon S3 bucket.
Task creation: The HIT creation cron job reads this file, programmatically creates tasks for MTurk workers using the MTurk API and records the metadata returned by the MTurk service in an Apache Cassandra database.
Rating calculation: Independent of both these jobs, this job runs the rating calculation process that fetches previously completed tasks from MTurk, calculates a rating score for the image assets associated with those tasks, and persists the score. It also fetches any expired tasks and re-creates them.

architectural overview

Conclusion

Our early results obtained from MTurk are promising. Several hundred thousand image assets have been given ratings on all three facets and we found that while the workers did a great job in identifying photos with a “great” and “bad” quality rating, the accuracy of results with a “good” quality rating for photos can still be improved. Examples of the ratings received can be found below:

Some examples of ratings received.

Thus, it is evident that using MTurk for crowdsourcing data can be difficult to get right the first time, but some experimentation — as seen in this blog post — can yield satisfactory results. Future steps for us would include using the quality rating scores to understand possible correlations between the quality of food images on a restaurant menu page on Grubhub and their performance on our delivery platform, encouraging restaurants to replace “bad” quality images on our platform to possibly enhance their business and exploring training an AI model using the data collected from MTurk for classifying images based on their quality.

Do you want to learn more about opportunities with our team? Visit the Grubhub careers page.

Using Amazon Mechanical Turk to crowdsource data on the quality of food images

Using Amazon Mechanical Turk to crowdsource data on the quality of food images

Using Amazon Mechanical Turk

User Experience

Using clear instructions

Task parameter tuning

Guarding against inaccurate results

Automating the rating calculation process

Conclusion

Recommend

Carolin Solskär answers Detectify Crowdsource FAQs

Github GitHub - detectify/ugly-duckling: Ugly Duckling is a lightweight scanner...

"We need to crowdsource the Metaverse Bill of Rights"

Using mechanical tools improves our language skills, study finds

Press information: Detectify Crowdsource hacker first to find Zero-Day Path Trav...

[1801.03534] Mechanical Computing Systems Using Only Links and Rotary Joints

Crowdsource by Google

ASUS ROG Strix Scope RX mechanical gaming keyboard returns to Amazon low at $100

TikTok Launches Branded Mission, A New Way To Crowdsource Creative

Vloggi launches $750k crowdsource raise on back of B2B video surge

About Joyk