5

[2302.10149] Poisoning Web-Scale Training Datasets is Practical

 1 year ago
source link: https://arxiv.org/abs/2302.10149
Go to the source link to view the article. You can view the picture content, updated content and better typesetting reading experience. If the link is broken, please click the button below to view the snapshot at that time.

[Submitted on 20 Feb 2023]

Poisoning Web-Scale Training Datasets is Practical

Download PDF

Deep learning models are often trained on distributed, webscale datasets crawled from the internet. In this paper, we introduce two new dataset poisoning attacks that intentionally introduce malicious examples to a model's performance. Our attacks are immediately practical and could, today, poison 10 popular datasets. Our first attack, split-view poisoning, exploits the mutable nature of internet content to ensure a dataset annotator's initial view of the dataset differs from the view downloaded by subsequent clients. By exploiting specific invalid trust assumptions, we show how we could have poisoned 0.01% of the LAION-400M or COYO-700M datasets for just $60 USD. Our second attack, frontrunning poisoning, targets web-scale datasets that periodically snapshot crowd-sourced content -- such as Wikipedia -- where an attacker only needs a time-limited window to inject malicious examples. In light of both attacks, we notify the maintainers of each affected dataset and recommended several low-overhead defenses.

Subjects: Cryptography and Security (cs.CR); Machine Learning (cs.LG)
Cite as: arXiv:2302.10149 [cs.CR]
  (or arXiv:2302.10149v1 [cs.CR] for this version)
  https://doi.org/10.48550/arXiv.2302.10149

About Joyk


Aggregate valuable and interesting links.
Joyk means Joy of geeK