The Model Is No Longer the Frontier
ACM convenes its first conference on agentic systems next week. Almost none of the papers touch the model. That's the story.
Every Sunday I pick one paper (or 10!) or release that’s worth your time, break it apart, and tell you why it matters. No hype. No summaries of summaries. Just the idea, explained.
Next week, ACM convenes the first edition of a conference that did not exist a year ago. Not a workshop bolted onto NeurIPS, not a model-release press event. A standalone Association for Computing Machinery conference, the kind that publishes a proceedings and sets a field’s agenda for the decade. It is called CAIS, the Conference on AI and Agentic Systems, it runs in San Jose from May 26 to 29, and its program already lists 61 peer-reviewed papers and 45 system demonstrations.
Read the call for papers and you notice what is missing. No track for scaling laws. No track for a new architecture that beats the last one on a benchmark. The four research areas are architectural patterns and composition, system optimization and efficiency, engineering and operations of compound AI systems, and evaluation and benchmarking. ACM, the most conservative naming body in computing, looked at the field and decided the interesting problem was no longer the model. It was the system around the model.
I spent the week reading the board. I ranked ten papers by how directly they touch the systems I run, decoded two of them line by line, and came out with one conclusion that I did not expect going in.
The frontier moved, and almost nobody is looking at where it went.
The Headline
Here is the whole argument in a paragraph. Models stopped being the bottleneck. The thing that decides whether an agent works in production is the scaffolding: how it improves itself, whether you can tell when it is quietly wrong, and whether you can govern what it is allowed to do. CAIS is the first major conference built around agentic-systems engineering, treating that scaffolding as the main event rather than the cleanup crew. Of the ten papers I pulled, the count that modify model weights is roughly zero. Every one is about turning a frozen model into a dependable agent. And the conference’s own vote board ranks them in almost exactly the wrong order for anyone who has to ship one.
The Concepts
Start with the inversion, because it reframes everything else.
On alphaxiv, the site where the AI research community upvotes the papers it finds most interesting, the conference runs a public vote board. As of mid-May, the runaway favorite, at 653 votes, is Scideator, a tool for scientific ideation. It is a lovely paper and I will come back to it. The papers sitting at the bottom of the vote count are an agent specification format at 9 votes, a schema for evaluating and routing requests across model gateways at 17, and a self-evolving agent memory system at 19.
If you have ever tried to put an agent into production, you already feel the problem. The 9-vote paper is the one you need on a Tuesday. The 653-vote paper is the one you screenshot for a keynote. The crowd rewards what is interesting to think about. The practitioner needs what holds weight when you build on it, and those are not the same papers. Once you see it, you read the whole conference differently: the boring papers are the roadmap, and the exciting ones are the scenery.
The vote board is a near-perfect inverse map of production relevance.
With that lens, the ten papers fall into three movements.
Movement one: getting better without retraining
For most of the deep-learning era, “make the model better at your task” meant one thing. Collect data, adjust weights, redeploy. That loop is slow, expensive, and it forgets nothing gracefully. The first cluster of CAIS papers throws it out.
FORGE is the cleanest statement of the shift. Its subtitle is the thesis: self-evolving agent memory, no weight updates. The agent gets better by accumulating and reorganizing what it has learned from its own runs, not by touching a single parameter. A second paper, on composing policy gradients with prompt optimization for language-model programs, comes at the same idea from a line of work known in the field as DSPy and GEPA. The move is to treat the prompt and the program wrapping the model as the thing you tune, and leave the weights frozen. The intelligence you add lives outside the network.
Scideator, the crowd favorite, belongs here too once you look past the application. Strip away the scientific-ideation framing and it is a recombination engine. It decomposes any paper into three facets, the Purpose, the Mechanism, and the Evaluation, then retrieves analogous work at deliberately varied conceptual distances and lets a researcher mix facets to seed new ideas. The mechanism that makes it work is not the language model. It is the facet decomposition. The authors prove it by removing the facet step and watching the system fall apart, a result that should be tattooed on the arm of anyone who has built a retrieval system: their novelty checker hits 89.66% accuracy at flagging non-novel ideas with the facet-aware reranking in place, and collapses to 13.79% without it. Same model, same corpus. The structure around the model did the work, and removing it took more than seventy points of accuracy with it. In a controlled study of 22 researchers, the tool lifted a standard creativity-support score from a median of 61 to 70.5.
Same model, same corpus. The structure around the model did the work, and removing it took more than seventy points of accuracy with it.
Three papers, one message. The improvement loop has left fine-tuning behind. It now lives in memory, in prompts, and in the structure you impose on retrieval. If your mental model of “improving an AI system” still starts with a training run, you are optimizing the layer that stopped moving.
Movement two: trusting what you cannot see fail
This is the movement I cannot stop thinking about, and it is anchored by the best paper I read all week.
The setup of “Trace-Level Analysis of Information Contamination in Multi-Agent Systems” is simple enough to explain at dinner. Take a team of specialist agents solving a hard task, the kind where one agent reads a PDF, another reads a table, another checks the facts, and a synthesizer writes the answer. Now corrupt the inputs the way the real world corrupts them. Swap two columns in a spreadsheet. Add scanner noise to a document. Blur an image. Run the same task clean and dirty, hold the workflow fixed, and watch what happens.
What happens is unsettling. The errors do not announce themselves. They propagate through the agent chain and poison the final answer without throwing a single exception. The authors name this information contamination, and across 614 paired runs they measure two things separately that everyone usually conflates: did the agent’s behavior change, and did the answer go wrong. Those two questions come apart. In 15.3% of runs the answer was silently corrupted while the execution trace barely moved. The system looked healthy and was lying to you.
Then comes the finding that should change how you operate agents tomorrow. Cost does not track correctness. Only 16.3% of the high-cost runs, the ones burning more than twice the normal tokens, actually succeeded. And 76.2% of the cheap, low-overhead runs failed silently. If you are watching token spend as a proxy for “the agent is working hard on a hard problem, probably fine,” you have the dashboard exactly backwards.
Expensive often means thrashing. Cheap often means confidently wrong.
The companion paper here is “Why Johnny Can’t Use Agents,” a study of the gap between what the industry promises agents will do and what users actually experience trying to use them. It is the human-facing version of the same truth the contamination paper proves mechanically. The demo works. The deployment is full of quiet failures that nobody designed a way to see. Most reliability work in this space watches for crashes. The crashes are the easy case. The dangerous failures are the ones where the trace looks normal, the cost looks fine, and the answer is wrong.
Movement three: governing the control surface
The third cluster is the least glamorous and the most enterprise. It is about what an agent is allowed to do, what it is running, and whether you can read what it did.
“Malice in Agentland” maps backdoors in the AI supply chain. The moment your agents install skills, pull in MCP servers (the connectors that hand an agent new tools), or run community plugins, you have inherited a software supply chain with all the trust problems of npm and none of the decade of hard-won defenses. A companion paper on tracking capabilities for safer agents comes at the same risk from the permission side: constrain what each component can touch, so a compromised piece cannot reach the whole system. Then there is the Open Agent Specification, the 9-vote paper, which proposes a declarative format for describing an agent and a standardized way to trace what it did. Add the gateway-routing schema and the structured-generation work, XGrammar-2, and you have the full legibility layer. A way to specify the agent, route its requests, force its outputs into a checkable shape, and audit the trace afterward.
None of this is exciting. All of it is what stands between a clever demo and something a regulated company will run.
This is the plumbing, and the plumbing is the frontier.
Why It Matters
If you lead a team that is building with agents, the three movements are not a literature review. They are a budget, a monitoring strategy, and a security posture. Here is what I would do with them.
Stop reading cost as a health signal. This is the single most actionable finding of the conference, and it is counterintuitive enough that it will not occur to your team on its own. The contamination paper proves that token spend and correctness are decoupled, and not weakly. Three quarters of cheap runs failed silently; five sixths of expensive runs failed too. If your agent observability is built on “alert me when spend spikes,” you are monitoring the wrong variable. The replacement is harder and unavoidable: outcome verification that is independent of the execution trace. A second check that looks at the answer, not at how much the agent thrashed to produce it. I have been running my own multi-agent setup with cost as a rough health proxy, and this paper is the reason I am rebuilding that assumption. The agents that worried me were the loud, expensive ones. The paper says the quiet, cheap ones are the threat.
Budget for memory and governance, not just inference. The whole improve-trust-govern stack costs real money and engineering time, and almost none of it is GPU. For two years the agent budget conversation has been about tokens and context windows. The CAIS papers are a forecast that the next year’s spend moves to the layers around the model: memory systems that let agents improve without retraining, verification systems that catch silent failure, governance systems that constrain what agents can reach. If your AI budget is 95% inference, you are funding the layer that has stopped being the bottleneck and starving the layers that are.
Treat skills, MCP servers, and plugins as a supply chain, because they are one. I have run a release-age cooldown on every package install across my homelab since a poisoned npm package made the rounds earlier this spring. Nothing new goes in until it has aged a week in public. “Malice in Agentland” and the capability-tracking work are the academic version of that instinct, and they generalize it. Every skill your agent loads is untrusted code with the agent’s permissions. The defenses are the unsexy ones from traditional software security: constrain capabilities, verify provenance, age your dependencies, audit the trace. The enterprise that gets this right early will be the one allowed to deploy agents in places that matter, because it can answer the question every risk committee will ask, which is what is this thing actually allowed to touch.
Read the vote board upside down. This is the meta-lesson, and it outlasts these specific papers. When you scan a conference, a launch, or a feed for what matters, the engagement signal points you at what is interesting to discuss, which is reliably not what is important to build on. The 9-vote agent specification is duller than the 653-vote ideation tool and far more likely to be infrastructure you depend on in two years. The discipline is to invert the popularity signal on purpose, to go looking for the paper everyone scrolled past because the title had the words “schema” or “specification” in it. The boring papers are where the field is moving. The exciting ones are where it has already been.
The deeper pattern under all of this is one that every technology repeats. It starts as a demo, a single impressive capability that makes people gasp. Then it becomes a system, and the interesting questions stop being “can it do the thing” and become “can I trust it, maintain it, and keep it from hurting me.” Databases made that move. The web made it. Distributed systems made it, and the engineering discipline that grew up around them is most of why anything online stays up. ACM launching a conference named for agentic systems, with tracks for operations and evaluation and composition, is the field announcing that the model era is the demo era, and the demo era is ending.
I run a multi-agent research system and a small fleet of autonomous agents on my own hardware, and everything in this post is something I have either been burned by or am about to rebuild because of these papers. The cost-as-health assumption was mine. The supply-chain cooldown came from getting scared, not from a paper. Reading the CAIS program felt less like learning something new and more like watching the academy catch up to what anyone running agents in anger already suspected. The model is the easy part now. Everything that makes the model useful, trustworthy, and safe is the hard part, and it is finally getting its own conference.
The frontier moved off the model. The papers that matter are the ones nobody voted for. And the work that decides whether any of it ships is all in the scaffolding.
The papers
All ten, grouped the way the post reads them, with their alphaxiv vote counts as of mid-May. The conference board is here. I decoded the first two line by line; the rest I read at the level the post describes.
Improve without retraining
Scideator: Human-LLM Scientific Ideation via Facet Recombination and Novelty Evaluation (653 votes) — the recombination engine; facet decomposition is the mechanism, not the model.
Composing Policy Gradients and Prompt Optimization for Language Model Programs (428) — tune the prompt and the program, freeze the weights. The DSPy / GEPA line of work.
FORGE: Self-Evolving Agent Memory With No Weight Updates (19) — agents improve by reorganizing what they learned from their own runs.
Trust what you cannot see fail
Trace-Level Analysis of Information Contamination in Multi-Agent Systems (74) — the cost-correctness decoupling; 614 paired runs, silent corruption.
Why Johnny Can’t Use Agents: Industry Aspirations vs. User Realities (151) — the human-facing version of the same gap.
Govern the control surface
Malice in Agentland: Backdoors in the AI Supply Chain (128) — your agents’ skills and plugins are an untrusted supply chain.
Tracking Capabilities for Safer Agents (91) — constrain what each component can touch.
Open Agent Specification (9) — a declarative format for an agent plus a standard way to trace what it did.
SEAR: Schema-Based Evaluation and Routing for LLM Gateways (17) — eval and routing across model gateways.
XGrammar-2: Structured Generation for Agentic LLMs (142) — force model outputs into a checkable shape.
Sunday Deep Dive is a weekly series on Run Data Run. Every Sunday I pick one paper (or 10!), release, or technique worth understanding, break it apart, and tell you what it means for your work. Free every Sunday, no paywall. If it was useful, the easiest way to support it is to subscribe and forward it to one person on your team who’d want it. If it wasn’t, tell me why. I’ll make it better.



