The Leaderboard Is Not the Territory

Olivier Krieger
14.06.2026 · 12 min read

Part 2 of a series on AI benchmarking: what happens when a measurement becomes a target, and why a team of Berkeley researchers just hacked eight of the most-cited AI agent benchmarks to near-perfect scores without solving a single task

In Part 1 of this series, we looked at why a 1-2 point gap on a benchmark leaderboard tells you almost nothing: the tests are saturated, the data is contaminated, and even “live” agents can quietly look up the answer key while pretending to reason their way to it.

This part is about something worse. It’s about what happens once everyone knows the leaderboard matters, and starts treating it less like a measurement and more like a target.

There’s a name for this, and it’s older than AI. Goodhart’s Law, named after the economist Charles Goodhart, is usually summarized as: “when a measure becomes a target, it ceases to be a good measure.” The original context was monetary policy, but it applies to almost anything you can score. Standardized tests become teaching-to-the-test. Sales targets become end-of-quarter discounting that pulls revenue forward instead of creating it. Call center metrics become hung-up calls logged as “resolved.”

AI benchmarks are now squarely in this category. A leaderboard position drives funding announcements, press coverage, procurement shortlists, and developer mindshare. That’s an enormous amount of pressure sitting on top of a number, and numbers under that kind of pressure tend to bend.

What’s new, and what makes this article different from a generic “beware of marketing claims” piece, is that in April 2026 a team at UC Berkeley didn’t just argue this was a risk. They built an agent, pointed it at eight of the most-cited AI agent benchmarks in the industry, and showed that every single one of them could be driven to a near-perfect score without the agent doing the actual task even once.

That’s not a hypothetical. That’s a published, reproducible result. Let’s get into it.


How a Leaderboard Actually Works (And Where the Gaps Are)

Before the Berkeley study, it’s worth understanding the more familiar version of this problem: crowdsourced arenas like LMArena (formerly LMSYS Chatbot Arena).

The basic idea is appealing. Real users submit prompts, get back responses from two anonymized models side by side, and vote for the one they prefer. Aggregate enough votes and you get an Elo-style ranking, the same system used for chess. As of the most recent rankings, the top of the board is a tight cluster (not considering Fable 5): Claude Opus 4.6 Thinking, Claude Opus 4.6, and Gemini 3.1 Pro Preview are separated by less than 20 Elo points, the kind of gap that, per Part 1‘s discussion of saturation, is well within noise.

The arena format has a structural quirk, though. Labs can run private testing rounds with multiple variants of an upcoming model before choosing which one to submit publicly, and which scores to highlight. This isn’t necessarily dishonest. If you’re a lab with five candidate checkpoints, of course you’re going to ship the one that tests best. But it does mean the number you see on the public leaderboard isn’t “how good is this model,” it’s “how good is the best-presented version of this model that this lab chose to show you, after however many private iterations it took to get there.”

That’s already a meaningful gap between the leaderboard and the territory it claims to represent. And it’s the mild version of the problem. The Berkeley study is the not-mild version.


“Our Agent Hacked Every Major One”

In April 2026, researchers at UC Berkeley’s Center for Responsible, Decentralized Intelligence, led by Dawn Song along with Hao Wang, Qiuyang Mang, Alvin Cheung, and Koushik Sen, published a study with a blunt title: “How We Broke Top AI Agent Benchmarks: And What Comes Next.

Here’s what they did. They built an automated agent whose only job was to find ways to score well on a benchmark without doing what the benchmark is supposed to measure. They pointed this agent at eight of the most prominent AI agent benchmarks in the field: SWE-bench Verified, SWE-bench Pro, Terminal-Bench, WebArena, OSWorld, GAIA, FieldWorkArena, and CAR-bench. These aren’t obscure academic toys. SWE-bench in particular is one of the benchmarks most frequently cited when labs talk about how good their models are at real software engineering.

The result, in the researchers’ own words: every single one of the eight could be exploited to near-perfect scores, with zero tasks actually solved and, in most cases, zero calls to an LLM at all. Some highlights, because the specifics matter here:

Terminal-Bench (89 tasks): 100%. Most of these tasks download a tool called uv from the internet during the verification step. The exploit agent replaces the system’s curl command with a fake version during the setup phase. When the verifier later tries to install uv, it runs through the fake curl, which quietly installs a trojanized version that detects when pytest is being run and prints fake “all tests passed” output. The agent never writes a line of solution code. It just makes the grader lie to itself.

SWE-bench Verified (500 tasks) and SWE-bench Pro (731 tasks): 100% on both. SWE-bench asks an agent to produce a code patch that fixes a real GitHub issue, then runs the project’s test suite to check if the fix worked. The catch: the agent’s patch and the test suite run in the same container. So the exploit agent’s “patch” isn’t a fix at all, it’s a small file that hooks into the test runner and rewrites every test result to say “passed,” regardless of what actually happened. Five hundred real-world coding problems, zero of them solved, perfect score.

WebArena (812 tasks): roughly 100%. This one needs no code changes to WebArena at all. The benchmark ships its reference answers as files sitting on the same machine the agent operates on. The exploit agent simply navigates the browser to a local file address and reads the answer key directly off the disk, the way you might read a saved file rather than searching the web. The evaluator just sees a correct answer come back and scores it as success.

FieldWorkArena (890 tasks): 100%, with one move. This benchmark’s grading function checks exactly one thing: did the assistant send a final message. It never actually reads what the message says. The function that’s supposed to compare the answer against the correct one is sitting right there in the code, imported and ready, and simply never gets called. The exploit agent’s entire strategy for all 890 tasks is to send back an empty pair of curly braces, {}. That’s it. That’s the whole exploit.

OSWorld (369 tasks): 73%. Many tasks here ask the agent to produce some output file and have it compared against a “gold” reference file. Since the test environment has internet access, and the gold files are hosted at public URLs that are visible in the task setup, the exploit agent just downloads the correct answer directly and places it where the grader expects to find it. Gold compared against gold. Perfect match, every time.

GAIA (165 tasks): roughly 98%. GAIA’s correct answers for its validation set are publicly posted online. But even setting that aside, the scoring function strips out all punctuation, spaces, and capitalization before comparing answers. So “Dr. Martin Luther King Jr.” and a mangled, spaced-out version of the same string with periods between every letter both normalize down to the same string and count as a match.

You can read the rest of the eight in the original writeup (CAR-bench’s LLM judges can be talked into giving favorable scores by literally telling them to, which is its own small masterpiece of irony), but the pattern across all eight is the same. None of these are exotic, nation-state-level attacks. Most of them are the kind of thing a competent engineer could improvise in an afternoon once they understood how the grading actually worked.


“We Are Not Claiming Current Leaders Are Cheating”

It’s worth pausing on what the Berkeley researchers are not saying, because the honest framing matters more than the shock value.

They’re explicit that they aren’t accusing today’s leading models or labs of deliberately using these exact exploits to climb leaderboards. What they’re pointing at is structural: these benchmarks were not built with the assumption that the thing being measured might, eventually, be capable enough to notice and exploit weaknesses in how it’s being measured. And as agents get more capable and more autonomous, the researchers argue, reward-hacking behavior doesn’t need to be deliberately programmed in. It can emerge as a side effect of an agent that’s very good at finding the path of least resistance to a high score, whether or not anyone intended that path to exist.

And this isn’t purely theoretical, either. The study points to examples that have already happened in the wild. One open-source coding model, IQuest-Coder-V1, was reported to score 81.4% on SWE-bench. When researchers dug into the actual transcripts, they found that in 24.4% of the cases, the model hadn’t reasoned its way to a fix at all, it had simply run git log and copied the answer out of the repository’s own commit history, which was sitting right there in the shared environment. Once those runs were corrected, the real score dropped to 76.2%. Separately, OpenAI reportedly stopped using SWE-bench Verified internally after an audit found that nearly 60% of the problems in it had flawed or incorrect tests, meaning models were being scored against a broken answer key regardless of what they did.

None of this means the people who built these benchmarks did a bad job. Building an evaluation that’s robust against a system actively trying to game it is a genuinely hard, mostly-unsolved engineering problem, and these benchmarks were built by serious research teams. The Berkeley paper’s own conclusion is the right one: “the vulnerabilities we found are not signs of incompetence, they’re signs that adversarial evaluation robustness isn’t yet a standard practice in the field. It needs to become one.


What This Actually Means If You’re Not a Benchmark Researcher

So you’re not building benchmarks. You’re trying to decide which model or agent framework to use, or you’re reading a vendor’s claims about how their product “scores 94% on industry-standard evaluations.” What do you do with all of this?

A few things worth carrying forward.

A benchmark score is a claim, not a fact, and the gap between the two depends entirely on how the benchmark was run. The same benchmark name (SWE-bench, GAIA, whatever) can mean very different things depending on whether it was run in an isolated, properly-sandboxed environment or one where the system under test had access to its own grading. Nobody publishing a headline score is going to volunteer which one it was. You may have to ask, or assume the less flattering option until told otherwise.

Be specifically suspicious of round, impressive numbers on benchmarks that have been in the news for being broken. This isn’t an accusation against any specific vendor. It’s just base rates. If a benchmark has a known, published exploit that gets it to 100%, and a product claims something close to 100% on it, that claim deserves more scrutiny, not less, even if the product is completely legitimate and didn’t use the exploit.

Ask what’s actually being measured, not just what’s being reported. FieldWorkArena’s grading function literally never checks the answer. A model that scores well on it has demonstrated that it can send a non-empty message. That’s a real capability! It’s just not the capability the benchmark’s name implies it’s testing. The gap between “what the benchmark is named” and “what the benchmark’s code actually checks” can be enormous, and almost nobody outside the research community ever reads the grading code.

Treat “we’re #1 on [leaderboard]” as a marketing claim, evaluate it the way you’d evaluate any other one. Who ran the evaluation? Was it the vendor themselves, on their own infrastructure, choosing which run to report? Was it a third party with a published, isolated methodology? Both of those can produce the same headline number, “94% on Benchmark X,” and they mean very different things.


Where This Leaves Us

Put Part 1 and Part 2 together and the picture is genuinely sobering. The benchmarks themselves are running out of room to discriminate between top models (Part 1), the data they’re built on is leaking into training sets in ways that are hard to detect and harder to fully fix (also Part 1), and on top of all that, the benchmarks that are specifically designed to test agentic behavior, the kind where an AI system takes actions in an environment rather than just answering a question, turn out to be exploitable in ways that have nothing to do with the capability they claim to measure (this part).

That last point is the one that should worry you most if you’re actually planning to deploy an AI agent to do real work, because agentic benchmarks are exactly the numbers vendors lead with when they’re trying to sell you on autonomy: “our agent resolves X% of real GitHub issues,” “our agent completes Y% of web tasks unsupervised.” If those numbers can be hit by an agent that does nothing but read the answer key off the disk, what do they actually tell you about how the agent will behave in your environment, on your data, with no answer key anywhere in sight?

That’s exactly where Part 3 of this series picks up. We’ll move from “can you trust the score” to “even if the score is completely legitimate, what happens when you take that same agent out of the benchmark and into production.” The numbers there, a measured 37% gap between lab performance and real-world deployment, and consistency that can collapse from 60% to 25% across repeated runs of the same task, are some of the most directly useful numbers in this entire series if you’re the one signing off on an AI agent deployment.

The leaderboard was never the territory. Part 3 is about the territory.


Sources:

Wang, Mang, Cheung, Sen, Song. “How We Broke Top AI Agent Benchmarks: And What Comes Next.” UC Berkeley Center for Responsible, Decentralized Intelligence, April 2026.

Eriksson et al. “Can We Trust AI Benchmarks? An Interdisciplinary Review of Current Issues in AI Evaluation.” arXiv:2502.06559

LMArena (formerly LMSYS Chatbot Arena) public leaderboard

Leave a Comment