[Submitted on 20 Feb 2023]

Poisoning Web-Scale Training Datasets is Practical

Deep learning models are often trained on distributed, webscale datasets crawled from the internet. In this paper, we introduce two new dataset poisoning attacks that intentionally introduce malicious examples to a model's performance. Our attacks are immediately practical and could, today, poison 10 popular datasets. Our first attack, split-view poisoning, exploits the mutable nature of internet content to ensure a dataset annotator's initial view of the dataset differs from the view downloaded by subsequent clients. By exploiting specific invalid trust assumptions, we show how we could have poisoned 0.01% of the LAION-400M or COYO-700M datasets for just $60 USD. Our second attack, frontrunning poisoning, targets web-scale datasets that periodically snapshot crowd-sourced content -- such as Wikipedia -- where an attacker only needs a time-limited window to inject malicious examples. In light of both attacks, we notify the maintainers of each affected dataset and recommended several low-overhead defenses.

Subjects:	Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Cite as:	arXiv:2302.10149 [cs.CR]
	(or arXiv:2302.10149v1 [cs.CR] for this version)
	https://doi.org/10.48550/arXiv.2302.10149

[2302.10149] Poisoning Web-Scale Training Datasets is Practical

Poisoning Web-Scale Training Datasets is Practical

Recommend

MIUI14 真够拉的~~~

付费请教~运维大佬看过来-查找 centos 系统上 buildkitd 容器来源

The classic version of Angry Birds is being delisted from Google Play - The Verg...

BandwagonHost|瓦工92限量款|全机房测试记录|个别机房解锁奈飞

余承东：华为没必要下场造车 - CNMO

美国科技行业裁员蔓延但其他企业仍在增加员工

中移投减持海天瑞声，ChatGPT还能火多久？

Sydney Uni tips $7.4 million into quantum innovation hub Future Qubit Foundry

早报：小米POCO C55发布华为分享遥控器专利公布

6 Storybook Tutorials for Product Development Teams

About Joyk