The importance of tracking dataset retractions and updates

Leigh Dodds Data, Data Infrastructure October 30, 2020October 30, 2020

2 Minutes

There are lots of recent examples of researchers collecting and releasing datasets which end up raising serious ethical and legal concerns. The IBM facial recognition dataset being just one example that springs to mind.

I read an interesting post exploring how facial recognition datasets are being widely used despite being taken down due to ethical concerns.

The post highlights how these datasets, despite being retracted, are still being widely used in research. This is in part because the original datasets are still circulating via mirrors of the original files. But also because they have been incorporated into derived datasets which are still being distributed with the original contents intact.

The authors describe how just one dataset, the DukeMTMC dataset was used in more than 135 papers after being retracted, 116 of those drawing on derived datasets. Some datasets have many derivatives, one example cited has been used in 14 derived datasets.

The research raises important questions about how datasets are published, mirrored, used and licensed. There’s a lot to unpack there and I look forward to reading more about the research. The concerns around open licensing are reminiscent of similar debates in the open source community leading to a set of “ethical open source licences“.

But the issue I wanted to highlight here is the difficulty of tracking the mirroring and reuse of datasets.

Change notification is a missing piece of our data infrastructure.

If it were easier to monitor important changes to datasets, then it would be easier to:

maintain mirrors of data
retract or remove data that breached laws or social and ethical norms
update derived datasets to remove or amend data
re-run analyses against datasets which has seen significant corrections or revisions
assess the impacts of poor quality or unethically shared data
proactively notify relevant communities of potential impacts relating to published data
monitor and review the reasons why datasets get retracted
…etc, etc

The importance of these activities can be seen in other contexts.

For example, Retraction Watch is a project that monitors retractions of research papers. CrossMark helps to highlight major changes to published papers including corrections and retractions.

Principle T3: Orderly Release, of the UK Statistics Authority code of practice explains that scheduled revisions and unscheduled corrections to statistics should be transparent, and that organisations should have a specific policy for how they are handled.

More broadly, product recalls and safety notices are standard for consumer goods. Maybe datasets should be treated similarly?

This feels like an area that warrants further research, investment and infrastructure. At some point we need to raise our sights from setting up even more portals and endlessly refining their feature sets and think more broadly about the system and ecosystem we are building.

The importance of tracking dataset retractions and updates

The importance of tracking dataset retractions and updates

Recommend

security - Encrypting user data on App Engine - Stack Overflow

Cultural Differences in the Workplace

Smore’s Best Work From Home Tips & Tricks

Introducing Remote Learning Weekly

Teacher Appreciation Week 2020

Introducing Smore Academy

Helpful Solutions for Back to School

How to Integrate RHEL 7 or CentOS 7 with Windows Active Directory

How To Install And Configure Windows Server 2019 And Project Honolulu

How to Install PostgreSQL on Windows Servers

About Joyk