Evals Are the New Bottleneck

Twenty-one thousand seven hundred and thirty agent runs. Nine models. One leaderboard sweep...

May 06, 2026

… Forty thousand dollars.

That is what it cost the Holistic Agent Leaderboard team at Hugging Face to run a single standardized evaluation of nine current frontier models, as reported by Clémentine Fourrier on April 29. Some scientific machine-learning evaluations now require up to 3,840 H100-hours per pass. The training run is a Tuesday afternoon. The evaluation is the rest of the week.

For a decade the AI cost story was a training story. Whoever could afford the most compute won. That story is over. The cost of knowing whether a model works has overtaken the cost of building one.

The bottleneck moved while everyone was looking at GPUs.

Most leadership teams haven’t moved with it.

What flipped

Three forces hit the same window.

Training got commodity. Distilled open weights, parameter-efficient fine-tuning, and a long weekend of GPU time will produce a credible model. The DeepSeek effect was the warning shot. Everything since has been the recoil.

Agents made evaluation combinatorial. An agent task isn’t one model call. It’s dozens of them stacked together. Each step might fail. Each step gets graded. CivBench, an industry agent benchmark, reported that running a single match against Claude Opus 4.6 cost them $1,200. Multiply that by candidate models, by random seeds, by every benchmark you care about. The arithmetic gets ugly fast.

Judges became expensive AND unreliable. Most modern AI evaluation uses one model to grade another’s output. Every grading call costs real compute. And the grader has its own bias, its own variance, and its own bill. Philipp Schmid, a developer-advocate at Hugging Face, captured the moment in March:

“All will depend on how good is our eval.”

He’s not wrong, and he’s not alone in saying it.

Benchmarking is only part of evals

The public benchmarks people quote are saturated and leaking.

The headline benchmarks you see in vendor decks (the ones with names like MMLU and SWE-bench) are saturated. Every serious model now scores above 90% on most of them. A senior writer on Hacker News declared one of the most cited coding benchmarks “no longer measures frontier coding capabilities.” The community response has been to build benchmarks that refresh weekly, hide their answer keys, or tag problems by release date so they can’t be memorized. The latest of those, published May 1: the leading model passes 66.7%. No model above 70%.

Building a benchmark that resists contamination is itself a serious engineering job. It’s not free. It’s not optional. And it’s still only the easy part.

The hard part is grading the work that’s specific to your domain, on data your model has never seen, with a judging system whose accuracy you can defend six months from now. Anthropic’s BioMysteryBench, published April 29, is the cleanest recent example of what that work looks like. Ninety-nine bioinformatics questions written with domain experts, on real workflow data. Claude Opus 4.6 matched expert baselines on routine work. Anthropic’s unreleased “Mythos Preview” occasionally solved problems a panel of biologists could not. I covered the methodology in detail in Sunday’s deep dive. The numbers are interesting. The methodology is the story.

A domain benchmark that resists contamination, grades the work your team actually does, and stays valid past the next quarterly model release is not a side project. It’s the work. Most enterprise programs don’t have one. Most don’t know they need one.

Why it’s the bottleneck

The thing measuring your model is itself a model.

That sentence sounds like a koan. It isn’t.

The instrument has bias. The instrument has variance. We forgot to write down its calibration date.

A peer-reviewed paper accepted to ICLR 2026 formalized what practitioners already suspected: when you use one large language model to grade another, the judge systematically favors outputs from models in its own family. Two separate industry tests landed the same point in different ways. One found that an automated judge agreed with human raters only 62% of the time. Another found a single agent reporting 89% accuracy on its preferred benchmark and 8.1% on a harder one. Same software. Different judge. Twelvefold gap.

Hamel Husain, a widely-followed independent AI consultant, ran a public experiment worth more attention than it got. Ask a model to rigorously review your code. In a new thread, ask the same model to address those review comments. Repeat. Husain’s finding:

“This loop always goes on for a ridiculously long time, no matter the harness.”

The thing trying to grade itself can’t converge.

The visibility gap is the other half. In April, a public Reddit thread documented Claude Opus 4.6 quietly dropping its reasoning effort mid-week. No changelog. No benchmark drop. Users noticed before the instruments did. Internal evaluations miss silent quality drift. External user signal catches it. Most programs have neither side wired into the same loop.

Speed equals how fast you can know.

What to do this quarter

Five moves, in order.

Pick one workflow. Build one evaluation that covers it end-to-end. Snapshot it. Tag the version. Resist the urge to grade everything at once. A team with one trustworthy eval ships faster than a team with twelve negotiable ones.
Treat your judging system like a regulated instrument. Tag it. Fix the prompt. Snapshot the rubric. When the judge drifts, you want to know before the score drifts.
Make evaluation a gate before deploys, not a post-mortem after. The same way your engineering team uses automated tests to catch broken code before it ships, your AI team should use evaluation to catch a broken model before users see it.
Pipe production signal back into the evaluation set. The regressions your users feel that don’t show up in your harness are exactly the gap to close.
Borrow from the people in your building who already do this. This is where most programs miss the easy answer.

Every functioning research organization already employs people whose job is calibrating instruments, tracking drift, and defending measurements under regulatory scrutiny. Pharma has biostatisticians and biomarker validation scientists. Materials labs have metrologists. Clinical research operations has statistical analysis plans and FDA validation pipelines. AI evaluation engineering is the same shape of problem in different clothes.

This is not an analogy I’m reaching for. AlphaFold ships with confidence intervals because structure prediction without calibrated uncertainty is unusable. Materials-science benchmarks have wrestled with contamination, distribution shift, and judge bias for years. The fact that evaluating a model is hard, slow, and instrument-dependent is news to AI. It is not news to science.

Don’t reinvent the discipline. Import it.

The line worth taking with you

The training cost story was about scale. The eval cost story is about confidence. A program that can train fast but can’t grade fast is shipping slowly. A program that grades fast but doesn’t track judge drift is shipping confidently into a fog.

If you can’t grade it, you can’t ship it. If you can’t grade it the same way next week, you’ve shipped nothing.

The bottleneck moved. Move with it.

A deeper, more technical companion to this post is coming on AIXplore shortly. It will cover the judge-variance work, contamination dynamics, and what an evaluation pipeline actually looks like inside a production AI program. This piece is the version that belongs in your leadership deck. The next one is for the engineers reading over your shoulder.

Run Data Run

Discussion about this post

Ready for more?