The Numbers Are Lying to You (A Little)

Part 1 of a series on AI benchmarking: why a 2-point gap on a leaderboard tells you almost nothing, and how the tests themselves stopped being trustworthy

Every few weeks, a new model drops, and within hours someone has posted a chart. MMLU, GPQA, HumanEval, whatever the flavor of the month is. Model A scores 91.2%. Model B scores 89.7%. Headlines get written. LinkedIn posts get written (guilty). Procurement decisions, in some cases, get influenced.

Here’s the uncomfortable question worth sitting with: what does a 1.5-point difference on a benchmark actually mean? Does it mean Model A is better? Marginally better? Better at the specific 14,000 multiple-choice questions in that benchmark, but not necessarily at the task you’re going to throw at it tomorrow?

This is the first part of a series looking at AI benchmarking, why we trust these numbers as much as we do, and what the research says about whether that trust is earned. Spoiler: it’s earned less than you’d think, and the reasons are more interesting than “benchmarks are flawed” (which, sure, everything is flawed). The reasons are structural, and they’re getting worse as AI systems get more capable, not better.

The Saturation Problem

Let’s start with the simplest issue: many of the benchmarks everyone cites have stopped being useful, because the models have gotten too good at them.

MMLU (Massive Multitask Language Understanding) has been one of the default benchmarks for years. It’s a broad test, 57 subjects, multiple choice, covering everything from elementary math to professional law. For a long time, MMLU scores were a genuinely useful signal: a model scoring 70% was meaningfully more capable than one scoring 50%.

Frontier models are now scoring north of 88% on MMLU and its harder successor, MMLU-Pro. At that level, the test has a problem: it’s running out of room. If the realistic ceiling is somewhere around 95% (because some questions are ambiguous, mislabeled, or just badly written, and no model, however capable, will get those “right”), then the entire useful range of the test has compressed into a 7-point band. A 1.5-point difference inside that band could reflect a real capability gap. It could also reflect which exact mislabeled questions each model happened to get “right” by accident.

This is the equivalent of giving a university entrance exam to a room full of postdocs and then ranking them by score. Sure, there will be a ranking. It will tell you almost nothing about who the better researcher is, because the test stopped discriminating between them long before it ran out of questions.

The field’s response has been to build harder tests. Humanity’s Last Exam is one of the more interesting attempts: a benchmark specifically designed to be hard enough that current frontier models top out around 45%, while human domain experts average closer to 90%. That’s a 40+ point gap, the kind of gap that actually tells you something, because there’s clearly room for models to improve and clearly a meaningful difference between “good” and “not good” performance.

But here’s the thing about a benchmark designed to be hard today. Give it eighteen months. The same compression that happened to MMLU will start happening to Humanity’s Last Exam, or whatever replaces it. This isn’t a flaw in any particular benchmark. It’s the nature of measuring a moving target with a fixed ruler. The ruler runs out of marks.

The Contamination Problem

Saturation is the more visible issue. Contamination is the quieter, more troubling one, because it doesn’t just compress the useful range of a benchmark. It can make the benchmark actively lie about what it’s measuring.

Here’s the basic mechanic. Large language models are trained on enormous scrapes of the internet: web pages, forums, code repositories, academic papers, and yes, sometimes, benchmark datasets themselves, or close variants of them, or discussions of them. If a model has seen a benchmark’s questions (or near-identical versions) during training, its “performance” on that benchmark isn’t measuring reasoning ability anymore. It’s measuring memorization. The model isn’t solving the problem. It’s recalling the answer.

This sounds like it should be easy to fix: just check whether benchmark questions appear in training data, and remove them. In practice, it’s much harder than that, for a few reasons. Training datasets are enormous and not always fully documented, even by the people who built them. Contamination doesn’t require an exact match. A close paraphrase, a translated version, a discussion thread analyzing the benchmark question, all of these can leak signal into a model without showing up in a simple text-match search. And by the time a new benchmark is published, it often gets discussed, analyzed, and referenced online almost immediately, which means the clock on contamination starts ticking from day one.

A recent paper on this problem, “When Benchmarks Leak,” lays out the core dilemma clearly. There are two broad approaches to fixing contamination. One is to identify and strip out the contaminated questions before evaluation, but this changes the benchmark itself, and when contamination is widespread, you can end up with very little benchmark left. The other approach is to try to suppress the contaminated behavior at evaluation time, essentially nudging the model away from “remembering” toward “reasoning.” But these interventions tend to degrade performance on genuinely clean questions too, so you’re trading one kind of inaccuracy for another.

The researchers propose a method (DeconIEP) that applies small, targeted perturbations to how a model processes its input, guided by a reference model that’s less contaminated, to steer the evaluated model away from memorization shortcuts without breaking its performance on legitimate reasoning. It’s a clever approach, and it reportedly works well. But notice what it implies: by 2026, contamination is common enough, and serious enough, that researchers are building entire frameworks just to evaluate models despite it. That’s not a footnote. That’s the field admitting the ruler itself has pencil marks already on it before you start measuring.

The New Twist: Agents That Cheat by Searching

Here’s where it gets genuinely interesting, and where the problem evolves into something the field hasn’t fully grappled with yet.

A lot of the most capable AI systems today aren’t just language models answering from memory. They’re agents: systems that can search the web, browse documents, and gather information during the task itself. This is, generally, a good thing. It’s also created a brand new flavor of contamination that didn’t exist when benchmarks were just “ask the model a question and see what it says.”

A paper published in June 2026, “Search-Time Contamination in Deep Research Agents” identifies exactly this problem. When you evaluate a “deep research” agent (one that searches the web as part of answering a question) on a public benchmark, that agent can sometimes find the benchmark itself online. Not just similar information: the actual question, the actual context, in some cases the actual ground-truth answer, sitting in a forum post, a leaderboard writeup, a GitHub repo discussing the benchmark.

The researchers define three escalating severities: the agent might find metadata about the benchmark (mild), find the question with surrounding context that makes it easier (moderate), or find the literal answer (severe). Across six public benchmarks, they found this kind of search-time contamination is widespread, and it can inflate measured performance by up to 4%.

Four percentage points might not sound like much. But remember the saturation discussion above: when frontier models are separated by 1-2 points on many benchmarks, a 4-point inflation from an agent essentially looking up the answer key isn’t noise. It can be the entire margin between first and fifth place on a leaderboard.

What makes this particularly tricky is that it’s not really “cheating” in the way a human would understand cheating. The agent is doing exactly what it’s designed to do: searching for relevant information. The benchmark just wasn’t designed with the assumption that the thing being tested could go look up the test. It’s a bit like giving someone an exam, telling them they’re allowed to use the internet, and then being surprised when they find the exam’s answer key was posted on a public forum three years ago.

The researchers’ recommendations are sensible: isolated sandboxes (so agents can’t reach the open internet during evaluation), transparent search trajectories (so you can see what an agent actually looked up), and controlled benchmark access (so benchmark content isn’t just sitting on the public internet for agents to stumble onto). All reasonable. All also a meaningful amount of extra engineering that most benchmark evaluations, including the ones generating those leaderboard charts you see on LinkedIn, are not doing today.

So What Do You Actually Do With a Benchmark Score?

None of this means benchmarks are useless. It means they need to be read the way you’d read any single data point: with context, and with appropriate skepticism about precision.

A few practical filters worth applying when you see a benchmark claim:

Treat small differences as noise, not signal. If two models are within 1-3 points of each other on a saturated benchmark like MMLU, treat them as roughly equivalent on that dimension. The difference is far more likely to reflect quirks of the test than a real capability gap.

Ask how old the benchmark is. A benchmark that’s been public for two years has had two years for its content to leak into training data, get discussed online, and become searchable. A genuinely “live” or continuously refreshed benchmark is more resistant to this, though even live benchmarks aren’t immune to search-time contamination if agents can browse the open web during evaluation.

Be especially skeptical of agent benchmarks where the agent has internet access. This is the newest and least well-understood failure mode. If a benchmark result comes from a system that searches the web as part of its operation, ask whether the evaluation controlled for what that system could find about the benchmark itself.

Look for benchmarks designed with a large gap between current performance and the ceiling. Tests like Humanity’s Last Exam, where even the best models are far from saturating, currently provide more signal than tests where everyone is bunched in the high 80s and 90s. That said, expect this to be temporary. Today’s hard benchmark is next year’s saturated one.

Remember that a benchmark measures the benchmark, not your use case. Even a perfectly clean, perfectly unsaturated benchmark is still measuring performance on a specific, fixed set of tasks designed by researchers, not the messy, open-ended, context-heavy work you’re actually going to ask an AI system to do. This is a thread we’ll pick up properly in Part 3 of this series, when we look at the gap between benchmark performance and real-world deployment for AI agents specifically, and the numbers there are even more striking than anything in this article.

Where This Leaves Us

The honest summary of Part 1 is this: benchmark scores are real measurements of something, but that something is narrower, noisier, and more fragile than the confident percentages on a leaderboard suggest. Saturation compresses the useful range of established tests. Contamination, in both its classic form (test data in training data) and its newer form (agents searching for the answer key during evaluation), inflates scores in ways that are genuinely difficult to detect and correct for.

None of this is a reason to ignore benchmarks. It’s a reason to treat them the way a good analyst treats any single source: useful, informative, and absolutely not the whole story.

In Part 2, we’ll look at what happens when benchmark scores stop being a measurement and start being a target, and why that distinction matters more than most people realize when a leaderboard position translates directly into funding, press coverage, and market position.

Sources:

Eriksson et al. “Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation.” arXiv:2502.06559

“When Benchmarks Leak: Inference-Time Decontamination for LLMs” (DeconIEP). arXiv:2601.19334

Wang et al. “Search-Time Contamination in Deep Research Agents: Measuring Performance Inflation in Public Benchmark Evaluation.” arXiv:2606.05241