arXiv 2602.12413

Soft Contamination Means Benchmarks Test Shallow Generalization

If LLM training data is polluted with benchmark test data, then benchmark performance gives biased estimates of out-of-distribution (OOD) generalization. Typical decontamination filters use n-gram matching which fail to detect semantic duplicates: sentences with equivalent (or near-equivalent) content that are not close in string space. We study this soft contamination of training data by semantic duplicates. Among other experiments, we embed the Olmo3 training corpus and find that: 1) contamination remains widespread, e.g. we find semantic duplicates for 78% of CodeForces and exact duplicates for 50% of ZebraLogic problems; 2) including semantic duplicates of benchmark data in training does improve benchmark performance; and 3) when finetuning on duplicates of benchmark datapoints, performance also improves on truly-held-out datapoints from the same benchmark. We argue that recent benchmark gains are thus confounded: the prevalence of soft contamination means gains reflect both genuine capability improvements and the accumulation of test data and effective test data in growing training corpora.

Authors Ari Spiesberger, Juan J. Vazquez, Nicky Pochinkov, Tomáš Gavenčiak, Peli Grietzer, Gavin Leech, Nandi Schoots

Paper arxiv.org/abs/2602.12413

Year 2025

Model Studied OLMo 3 Instruct

01 — The Problem

Are benchmark gains real?

Training corpora have grown 10,000× since 2020. More data means more chances to accidentally include the test.

"Soft contamination" — the training data contains benchmark items that aren't exact string matches but are semantically equivalent. N-gram decontamination misses these entirely.

"Shallow generalization" — the model gets better at that benchmark's format and domain, but the capability doesn't actually transfer. It learned the shape of the test, not the underlying skill.

What's new here

Scale: Searched 1% of OLMo 3's pretraining corpus (Dolma3, Dolmino) and all of the instruction-tuning / RL data (Dolci SFT, DPO, RL)
Realistic rates: Tests 5% contamination in fine-tuning data — not an artificial stress test, just what plausibly accumulates
Split evaluation: Train on half the benchmark, test on both halves. If unseen items improve too, that's shallow generalization
Cross-benchmark controls: Does fine-tuning on MuSR help on TrueDetective? No. The gains are benchmark-shaped, not capability-shaped

02 — Method

How the authors found the duplicates

Step 1

Embed

Embed benchmarks + 1% of pretraining data (Dolma3/Dolmino) and all instruction-tuning data (Dolci SFT, DPO, RL) using llama-embed-nemotron-8b (#2 on MTEB)

Step 2

Match

Compute cosine similarity between benchmark items and training documents

Step 3

Annotate

Use Gemini 3 Flash to classify high-similarity matches as semantic duplicates

Step 4

Finetune

Train OLMo 3 on duplicates (LoRA), evaluate seen vs. unseen benchmark splits

Benchmarks studied

Benchmark	Type	Size
MBPP	Python coding	257 tasks
CodeForces	Competitive programming	468 problems
MuSR	Murder mystery reasoning	250 problems
ZebraLogic	Logic grid puzzles	1,000 puzzles

Cosine similarity distributions

MBPP cosine similarity distribution (Dolma)

Fig 6. MBPP similarity distribution in Dolma base training data.

CodeForces cosine similarity distribution (Dolma)

Fig 6. CodeForces similarity distribution in Dolma base training data.

03 — Contamination in the Wild

The data

MBPP: 100% semantic contamination

Every MBPP problem has at least one semantic duplicate in the training data. Zero exact duplicates — so n-gram decontamination misses all of it.

Figure 2a: Cosine similarity vs semantic duplication for MBPP

Fig 2a. Cosine similarity vs. semantic duplication rate for MBPP.

Figure 2b: Cosine similarity vs semantic duplication for CodeForces

Fig 2b. Cosine similarity vs. semantic duplication rate for CodeForces.

Figure 4a: Semantic duplicates by training dataset for MBPP

Fig 4a. Semantic duplicate occurrence by training dataset (MBPP).

Figure 4b: Semantic duplicates by training dataset for CodeForces

Fig 4b. Semantic duplicate occurrence by training dataset (CodeForces).

ZebraLogic: exact copies in RL data

Figure 1. Proportion of ZebraLogic problems with exact duplicates by grid size. Larger puzzles (4×4+) have 70%+ contamination.

Figure 3: Semantic duplicate occurrence by Elo rating

Figure 3. Contamination by Elo rating. Difficulty doesn't matter — easy and hard problems are equally contaminated.

04 — Finetuning Results

Finetuning on duplicates improves benchmark scores

The authors split each benchmark in half, finetuned on semantic duplicates of one half, and evaluated on both. Performance improves on held-out items from the same benchmark — but not on related benchmarks.

MuSR (Murder Mystery Reasoning)

Accuracy after finetuning

Seen (trained on) Unseen (held out)

Baseline

66.0%

28.3%

Exact duplicates

87.9%

87.3%

Semantic (Level 1)

85.8%

86.2%

Semantic (Level 3)

87.5%

87.9%

TrueDetective (control)

~28%

MBPP (Python Coding)

Accuracy after finetuning

Seen (trained on) Unseen (held out)

Baseline

46.4%

55.3%

Exact duplicates

63.0%

48.8%

Semantic (Python)

55.1%

53.6%

HumanEval (control)

67.0%

05 — Ecological Validity

Even 5% contamination is enough

About 4 in 10,000 training data points are semantic duplicates of a given benchmark item. That's not a lot. It's enough.

OLMo 3 — Ecologically valid finetuning

Condition	Seen	Unseen	Δ vs Clean
Baseline (no FT)	42.8%	—	—
Clean finetuning	50.0%	28.0%	—
5% contaminated	66.4%	54.4%	+16.4 / +26.4
TrueDetective (ctrl)	~28% (unchanged)

5% of fine-tuning data replaced with semantic duplicates. Result: +16 points on seen items, +26 on unseen. That's a massive swing from a tiny contamination rate.

TrueDetective (a related reasoning benchmark) stayed flat at ~28%. The model didn't get better at reasoning. It got better at this specific benchmark.

Nobody needs to contaminate on purpose. Modern training pipelines accumulate enough semantic duplicates naturally to meaningfully inflate scores.

06 — What This Means

Benchmark progress is confounded

Decontamination has a blind spot. N-gram overlap only catches exact matches. When contamination is semantic, it passes through undetected. MBPP: 100% semantic contamination, 0% exact.
More data = more contamination. Corpora grew 10,000× since 2020. Frontier models (Llama 4: 30T tokens) have vastly more surface area for undetected duplicates.
Gains don't transfer. MuSR went from 28% to 87% on unseen items. TrueDetective (similar task type) didn't budge. This looks like benchmark-specific learning, not general reasoning improvement.
RL amplifies contamination. ZebraLogic had 70%+ exact duplicates in the RL data specifically. RL on benchmark-like items is an especially efficient way to inflate scores.
Synthetic data makes it worse. Modern pipelines deliberately create paraphrases of existing content. This increases soft contamination density beyond what occurs naturally.

Contamination scales with investigation

Figure 5a: MBPP contamination by number of matches investigated

Fig 5a. MBPP — more matches checked = more duplicates found. The true rate is higher than any single check reveals.

Figure 5b: CodeForces contamination by number of matches investigated

Fig 5b. CodeForces — same pattern. Contamination keeps growing with deeper search.

As more of the training data is searched, more duplicates are found. The numbers reported are likely lower bounds — only 1% of the pretraining corpus was searched (though all instruction-tuning and RL data was covered).

07 — Conclusions

What this means

It's everywhere

78–100% of benchmark problems have semantic duplicates in training data. Decontamination efforts aren't catching this.

It inflates scores

Training on semantic duplicates boosts performance on both seen and unseen benchmark items. The effect is large and consistent.

It doesn't transfer

Gains stay within the benchmark. Related tasks see no improvement. This is benchmark-shaped learning, not capability.

5% is enough

Ecologically plausible contamination rates produce double-digit score inflation. No adversarial intent needed.

Progress is confounded

Benchmark gains reflect both real capability and accumulated test data. Right now, we can't tell which is which.

"The prevalence of soft contamination means gains reflect both genuine capability improvements and the accumulation of test data."