Soft Contamination Means Benchmarks Test Shallow Generalization
If LLM training data is polluted with benchmark test data, then benchmark performance gives biased estimates of out-of-distribution (OOD) generalization. Typical decontamination filters use n-gram matching which fail to detect semantic duplicates: sentences with equivalent (or near-equivalent) content that are not close in string space. We study this soft contamination of training data by semantic duplicates. Among other experiments, we embed the Olmo3 training corpus and find that: 1) contamination remains widespread, e.g. we find semantic duplicates for 78% of CodeForces and exact duplicates for 50% of ZebraLogic problems; 2) including semantic duplicates of benchmark data in training does improve benchmark performance; and 3) when finetuning on duplicates of benchmark datapoints, performance also improves on truly-held-out datapoints from the same benchmark. We argue that recent benchmark gains are thus confounded: the prevalence of soft contamination means gains reflect both genuine capability improvements and the accumulation of test data and effective test data in growing training corpora.
Are benchmark gains real?
Training corpora have grown 10,000× since 2020. More data means more chances to accidentally include the test.
"Soft contamination" — the training data contains benchmark items that aren't exact string matches but are semantically equivalent. N-gram decontamination misses these entirely.
"Shallow generalization" — the model gets better at that benchmark's format and domain, but the capability doesn't actually transfer. It learned the shape of the test, not the underlying skill.
What's new here
- Scale: Searched 1% of OLMo 3's pretraining corpus (Dolma3, Dolmino) and all of the instruction-tuning / RL data (Dolci SFT, DPO, RL)
- Realistic rates: Tests 5% contamination in fine-tuning data — not an artificial stress test, just what plausibly accumulates
- Split evaluation: Train on half the benchmark, test on both halves. If unseen items improve too, that's shallow generalization
- Cross-benchmark controls: Does fine-tuning on MuSR help on TrueDetective? No. The gains are benchmark-shaped, not capability-shaped
How the authors found the duplicates
Benchmarks studied
| Benchmark | Type | Size |
|---|---|---|
| MBPP | Python coding | 257 tasks |
| CodeForces | Competitive programming | 468 problems |
| MuSR | Murder mystery reasoning | 250 problems |
| ZebraLogic | Logic grid puzzles | 1,000 puzzles |
Cosine similarity distributions
The data
MBPP: 100% semantic contamination
Every MBPP problem has at least one semantic duplicate in the training data. Zero exact duplicates — so n-gram decontamination misses all of it.
ZebraLogic: exact copies in RL data
Finetuning on duplicates improves benchmark scores
The authors split each benchmark in half, finetuned on semantic duplicates of one half, and evaluated on both. Performance improves on held-out items from the same benchmark — but not on related benchmarks.
MuSR (Murder Mystery Reasoning)
MBPP (Python Coding)
Even 5% contamination is enough
About 4 in 10,000 training data points are semantic duplicates of a given benchmark item. That's not a lot. It's enough.
OLMo 3 — Ecologically valid finetuning
| Condition | Seen | Unseen | Δ vs Clean |
|---|---|---|---|
| Baseline (no FT) | 42.8% | — | — |
| Clean finetuning | 50.0% | 28.0% | — |
| 5% contaminated | 66.4% | 54.4% | +16.4 / +26.4 |
| TrueDetective (ctrl) | ~28% (unchanged) | ||
5% of fine-tuning data replaced with semantic duplicates. Result: +16 points on seen items, +26 on unseen. That's a massive swing from a tiny contamination rate.
TrueDetective (a related reasoning benchmark) stayed flat at ~28%. The model didn't get better at reasoning. It got better at this specific benchmark.
Nobody needs to contaminate on purpose. Modern training pipelines accumulate enough semantic duplicates naturally to meaningfully inflate scores.
Benchmark progress is confounded
- Decontamination has a blind spot. N-gram overlap only catches exact matches. When contamination is semantic, it passes through undetected. MBPP: 100% semantic contamination, 0% exact.
- More data = more contamination. Corpora grew 10,000× since 2020. Frontier models (Llama 4: 30T tokens) have vastly more surface area for undetected duplicates.
- Gains don't transfer. MuSR went from 28% to 87% on unseen items. TrueDetective (similar task type) didn't budge. This looks like benchmark-specific learning, not general reasoning improvement.
- RL amplifies contamination. ZebraLogic had 70%+ exact duplicates in the RL data specifically. RL on benchmark-like items is an especially efficient way to inflate scores.
- Synthetic data makes it worse. Modern pipelines deliberately create paraphrases of existing content. This increases soft contamination density beyond what occurs naturally.
Contamination scales with investigation
As more of the training data is searched, more duplicates are found. The numbers reported are likely lower bounds — only 1% of the pretraining corpus was searched (though all instruction-tuning and RL data was covered).
What this means
It's everywhere
78–100% of benchmark problems have semantic duplicates in training data. Decontamination efforts aren't catching this.
It inflates scores
Training on semantic duplicates boosts performance on both seen and unseen benchmark items. The effect is large and consistent.
It doesn't transfer
Gains stay within the benchmark. Related tasks see no improvement. This is benchmark-shaped learning, not capability.
5% is enough
Ecologically plausible contamination rates produce double-digit score inflation. No adversarial intent needed.
Progress is confounded
Benchmark gains reflect both real capability and accumulated test data. Right now, we can't tell which is which.
"The prevalence of soft contamination means gains reflect both genuine capability improvements and the accumulation of test data."