A Few Hundred Files to Break an LLM: The New Reality of Data Poisoning

Lara Hanyaloglu
Oct 14
2 min read

A joint study shows that inserting a few hundred malicious documents can create reliable backdoors in large language models - and larger models aren’t immune.

What the study found

A recent joint security study led by Anthropic, together with the UK AI Safety Institute and the Allen Institute, delivers a blunt takeaway: poisoning large language models can be shockingly easy. The researchers showed that by inserting only a few hundred malicious documents into a model’s training data, an attacker can create a reliable backdoor that changes the model’s behavior whenever a specific trigger appears.

How the attack works

These models learn from vast crawls of internet text: billions of pages, forums, scraped documents. If an adversary can slip crafted pages or posts into that mix - pages that contain a particular trigger phrase or pattern - those poisoned samples become part of the corpus the model learns from. Later, when the model encounters that trigger in a prompt, it can reliably emit the attacker’s desired behavior: nonsense, hidden instructions, or even exfiltration of information that should remain private.

The experiments and the surprising result

The researchers tested four model sizes - roughly 600 million, 2 billion, 7 billion and 13 billion parameters - and polluted training runs with three contamination levels: 100, 250 and 500 corrupted documents, each containing a simple trigger token or token sequence. The striking result was that model size didn’t meaningfully protect against poisoning: even 250 poisoned documents could permanently bias outputs across models. Producing and publishing a few hundred web documents is trivial compared to poisoning millions of files, so the attack surface is large and practical.

Why this is worrying

There are two unsettling implications. First, bigger models are not magically more robust to this kind of attack. Intuitively one might expect that more parameters and more training data would dilute the influence of a few bad documents, but the study finds that isn’t the case in practice. Second, the barrier to mounting such an attack is low: creating and posting a few hundred web-hosted documents or forum posts is easy, making supply-chain manipulation an attractive option for bad actors.

Possible mitigations

Mitigations exist, but none are silver bullets. Better dataset hygiene and provenance - tracking where training data came from and favoring curated, audited sources over raw web crawls - is foundational. Detection methods to find anomalous or overly influential samples, adversarial training and robust fine-tuning, careful validation with red-team attacks, and runtime monitoring for suspicious triggers are all important layers. Cryptographic provenance for datasets and stricter norms around data publishing could raise the cost of injecting poisoned content. For high-assurance deployments, human-in-the-loop verification and stricter gating for sensitive outputs remain essential.

This research reframes model safety as a supply-chain problem: model architecture matters, but so does the provenance and integrity of training data. If we want models that are safe in the real world, we must treat training data with the same rigor we apply to algorithms and infrastructure. Otherwise, a few hundred corrupt documents on the open web are enough to turn a sophisticated language model into a tool that can be manipulated for mischief.