<?xml version="1.0" encoding="UTF-8"?><rss xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:atom="http://www.w3.org/2005/Atom" version="2.0" xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd" xmlns:googleplay="http://www.google.com/schemas/play-podcasts/1.0"><channel><title><![CDATA[Run Data Run]]></title><description><![CDATA[Clear thinking about AI from someone building it in production. No hype, no hand-waving. Just what works, what doesn't, and why it matters.]]></description><link>https://rundatarun.io</link><image><url>https://substackcdn.com/image/fetch/$s_!t_Ch!,w_256,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffa36f5aa-74af-4492-b8d7-93b03f14a337_1280x1280.png</url><title>Run Data Run</title><link>https://rundatarun.io</link></image><generator>Substack</generator><lastBuildDate>Fri, 19 Jun 2026 18:44:04 GMT</lastBuildDate><atom:link href="https://rundatarun.io/feed" rel="self" type="application/rss+xml"/><copyright><![CDATA[Justin Johnson]]></copyright><language><![CDATA[en]]></language><webMaster><![CDATA[rundatarun@substack.com]]></webMaster><itunes:owner><itunes:email><![CDATA[rundatarun@substack.com]]></itunes:email><itunes:name><![CDATA[Justin Johnson]]></itunes:name></itunes:owner><itunes:author><![CDATA[Justin Johnson]]></itunes:author><googleplay:owner><![CDATA[rundatarun@substack.com]]></googleplay:owner><googleplay:email><![CDATA[rundatarun@substack.com]]></googleplay:email><googleplay:author><![CDATA[Justin Johnson]]></googleplay:author><itunes:block><![CDATA[Yes]]></itunes:block><item><title><![CDATA[You Don't Have to Write the Code]]></title><description><![CDATA[Anthropic watched 400,000 sessions with a coding agent and found that what predicts success isn't your job or your syntax. It's whether you understand the work. That changes who gets to build, and what your expertise is worth.]]></description><link>https://rundatarun.io/p/you-dont-have-to-write-the-code</link><guid isPermaLink="false">https://rundatarun.io/p/you-dont-have-to-write-the-code</guid><dc:creator><![CDATA[Justin Johnson]]></dc:creator><pubDate>Wed, 17 Jun 2026 10:07:01 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/a3b81ead-a9e7-47dc-8fb3-3052d58b3a01_1408x768.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1mzG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ec3ec3c-01e7-4551-b000-e8aca7896a73_1408x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1mzG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ec3ec3c-01e7-4551-b000-e8aca7896a73_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!1mzG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ec3ec3c-01e7-4551-b000-e8aca7896a73_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!1mzG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ec3ec3c-01e7-4551-b000-e8aca7896a73_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!1mzG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ec3ec3c-01e7-4551-b000-e8aca7896a73_1408x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1mzG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ec3ec3c-01e7-4551-b000-e8aca7896a73_1408x768.jpeg" width="728" height="409.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3ec3ec3c-01e7-4551-b000-e8aca7896a73_1408x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;captionedImage&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1mzG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ec3ec3c-01e7-4551-b000-e8aca7896a73_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!1mzG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ec3ec3c-01e7-4551-b000-e8aca7896a73_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!1mzG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ec3ec3c-01e7-4551-b000-e8aca7896a73_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!1mzG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ec3ec3c-01e7-4551-b000-e8aca7896a73_1408x768.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This week Anthropic published <a href="https://www.anthropic.com/research/claude-code-expertise">an analysis of about 400,000 real Claude Code sessions</a>, run between last October and this April. It set out to answer a plain question: when someone sits down with a coding agent, what actually predicts whether they succeed? There is a comfortable answer and a more useful one, and almost everyone is repeating the comfortable one.</p><p>The comfortable answer, the one the headlines took, is that anyone can build software now. That part is true, and it is the least interesting thing in the report.</p><p>The finding underneath is the one I'd hand to anyone, whether they run a team or just their own work. What predicted success was not your job title, and it was not whether you could write code. It was whether you understood the problem you were trying to solve. Anthropic's own framing: success comes from how well a person understands the work, not whether they're trained in coding.</p><p>That is not a story about AI flattening the gap between your best people and your average ones. It is the opposite. The thing you spent fifteen years getting good at just became the thing that decides whether the most expensive tool you're buying actually pays off. I have been making that case for a while. This is the first time anyone has put 400,000 sessions behind it.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!1dkG!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd63cd64c-afe6-4bda-a07a-c7dbd6a6a6e0_1408x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!1dkG!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd63cd64c-afe6-4bda-a07a-c7dbd6a6a6e0_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!1dkG!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd63cd64c-afe6-4bda-a07a-c7dbd6a6a6e0_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!1dkG!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd63cd64c-afe6-4bda-a07a-c7dbd6a6a6e0_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!1dkG!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd63cd64c-afe6-4bda-a07a-c7dbd6a6a6e0_1408x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!1dkG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd63cd64c-afe6-4bda-a07a-c7dbd6a6a6e0_1408x768.jpeg" width="728" height="409.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d63cd64c-afe6-4bda-a07a-c7dbd6a6a6e0_1408x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;captionedImage&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!1dkG!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd63cd64c-afe6-4bda-a07a-c7dbd6a6a6e0_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!1dkG!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd63cd64c-afe6-4bda-a07a-c7dbd6a6a6e0_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!1dkG!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd63cd64c-afe6-4bda-a07a-c7dbd6a6a6e0_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!1dkG!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd63cd64c-afe6-4bda-a07a-c7dbd6a6a6e0_1408x768.jpeg 1456w" sizes="100vw"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>What the tool actually divided up</h2><p>Start with the cleanest number, because it doubles as a mental model you can carry into a Monday meeting.</p><p>In a typical session, the human made about <strong>70 percent of the planning decisions</strong> and the agent made about <strong>80 percent of the execution decisions</strong>. People decided what to build. The agent decided how to build it. That split held across every kind of work they measured, from writing code to running systems to analyzing data.</p><p>So the labor didn't disappear. It separated. The agent took the part a lot of people thought was the moat, the ability to actually produce the syntax, and handed back the part that was always the hard part: knowing what to ask for, and whether the answer is any good.</p><p>Here is what makes me trust the number. A practitioner reached the same shape from the other side. Addy Osmani, writing in January, found that the developers succeeding with these tools "spend 70 percent of their time on problem definition and verification strategy, 30 percent on execution." Two independent measurements, one watching a population and one watching working engineers, landing on the same line through the work. The tool made the typing cheap. It did nothing for the knowing.</p><blockquote><p><strong>The tool made the typing cheap. It did nothing for the knowing.</strong></p></blockquote><p>This is the thing I've been calling <a href="https://rundatarun.io/p/the-atomic-unit-of-work-just-changed">a change in the atomic unit of work</a>. The unit a person owns moved up the stack, from producing the thing to directing it. Now there's telemetry under the claim.</p><div><hr></div><h2>Your job title barely moved the needle</h2><p>This is the result Anthropic led with, and it earns the lead.</p><p>They scored a verified success rate, which is stricter than it sounds: a session counts as a win only if the model judged it successful and there was a hard signal to back that up, a real commit, a passing test, a user saying yes. By that bar, on code-producing work, software occupations succeeded <strong>34 percent</strong> of the time and everyone else succeeded <strong>29 percent</strong>. A marketer and a staff engineer finished a coding task at almost the same rate.</p><p>The instinct is to read that as "engineers are finished." It is the wrong read, and the same study hands you the right one. What collapsed was the premium on being a programmer. What held was the premium on understanding the problem. Measure expertise correctly and it mattered enormously: novice sessions succeeded <strong>15 percent</strong> of the time, and people who actually knew their domain landed between <strong>28 and 33 percent</strong>. Roughly double.</p><p>The reconciliation is the part that clarifies it, because it is the bet I've been making. Expertise here is not your title or your r&#233;sum&#233;. It is task-specific. Anthropic's own example: a senior engineer asking their first question about the Rust language is a beginner at Rust. The skill that predicts whether you get working software out of an agent was never "I am an engineer." It is "I understand this particular problem well enough to direct the work and catch it when it goes wrong."</p><p>That is the whole argument for the builder-leader, measured now across a population instead of asserted from a single desk. You do not have to become an engineer. You do not have to write the code. You direct it, inside a domain you already command, and the command is the thing that pays.</p><blockquote><p><strong>You don't have to become an engineer. You don't have to write the code. You direct it, inside a domain you already command.</strong></p></blockquote><div><hr></div><h2>The expert tell is recovery</h2><p>If I had to keep one number from the whole report, it would be this one.</p><p>Experts didn't just prompt better on a good day, though they did do more with each instruction: about <strong>12 agent actions per prompt</strong> versus <strong>5</strong> for novices, and roughly five times the output. The real difference showed up when things went sideways. When a novice hit trouble, they walked away with nothing written about <strong>19 percent</strong> of the time. For everyone with more domain knowledge, that abandonment rate was <strong>5 to 7 percent</strong>.</p><p>Sit with what that means. The expert hit the same wall the novice did. Then they routed around it, reframed the problem, and caught the fluent, confident, completely wrong answer before it shipped. The novice hit the wall and quit.</p><p>So the value your senior people add in an agentic workflow is not that they prompt cleaner on a Tuesday. It is that they fail better on a Thursday. They have the judgment to know when the plausible output is wrong, and the agent does not have that judgment and cannot get it from a model update. The bottleneck moved from typing to deciding. Pratima Arora at Smartsheet put it plainly this spring: the hours haven't changed, but the density of work has.</p><p>The scarcest thing in the building is now the thing your best people already carry. That is good news, and most of the coverage will skip right past it.</p><blockquote><p><strong>The scarcest thing in the building is now the thing your best people already carry.</strong></p></blockquote><div><hr></div><h2>The work moved up the stack</h2><p>Two more numbers close the loop, and they are the ones that tell you where this is going.</p><p>Between October and April, the share of sessions spent fixing broken code fell from a third to under a fifth, <strong>33 percent down to 19</strong>. Over the same stretch, the estimated value of the work people brought rose about <strong>27 percent</strong>, with the biggest jump in building something new, up <strong>43 percent</strong>.</p><p>Put those together and the trajectory is clear. People are not using the agent to fix more bugs. They are using it to attempt harder, more valuable, more end-to-end work, and bringing more judgment to bear when they do. The tool got cheaper per task, so the tasks got more ambitious.</p><p>This is the oldest pattern in economics wearing new clothes. Make a thing cheaper to produce and you do not produce less of it. You produce far more, and you need more judgment to steer all of it. That is the shape of this moment, and it cuts against the fear that the work is shrinking. The work isn't shrinking. It changed: what we attempt got bigger, and how we get there moved from doing to directing.</p><div><hr></div><h2>Two things held their price, not one</h2><p>There is a second thread here, and it is the one that turns a study into a strategy.</p><p>Anthropic measured what the human brings to the session, and found that what the human brings, domain command, is decisive. That is one of two things this whole shift never made cheaper. The other is what you build around the model.</p><p>The model is the commodity, and the system you build around it is the moat: the rules and skills and memory that turn a smart, forgetful chat box into something that gets better at your specific work over time. Birgitta B&#246;ckeler, writing on Martin Fowler's site, named the gap the model itself can never close. A coding agent, she wrote, has "no social accountability, no aesthetic disgust at a 300-line function, no intuition that 'we don't do it that way here,' and no organisational memory." Those are not features you buy a newer version of. They come from the person and the system around the model, or they don't come at all.</p><p>So the picture is narrower and more useful than "expertise wins." Two things resisted the repricing. What you bring to the model, which is domain command, and what you build around it, which is the system that holds your organization's judgment. The model in the middle, the part everyone is still shopping for on price, is the cheap layer between them. Anthropic just measured one of those two human pieces at a scale none of us could reach alone. The other one you build.</p><p>And both are leadership skills, not engineering ones. Setting intent, designing the handoffs, evaluating output against what good actually looks like. Those are the same instincts that got your best people to senior in the first place, pointed now at a system of agents instead of a team of people. No new species of human required. That is the part I keep coming back to, and it is why I think the people who internalize this will out-build the ones still arguing about whether the juniors are coming for the seniors.</p><div><hr></div><h2>One honest note</h2><p>At the level of the whole economy, you cannot see this yet in the aggregate numbers. A large Danish study of AI chatbots across occupations found no significant effect on earnings or recorded hours. I take it seriously, and I read it as a statement about timing, not a refutation. Session behavior changes before payroll does. The 400,000 sessions are the leading edge; the wage data lags. Hold the whole argument a little more loosely for it, and don't over-rotate on any single quarter's story.</p><div><hr></div><h2>What I'd do with this Monday</h2><p>If you run a team, here is the part that converts. If you don't, read it as where to put your own time.</p><p>Put the agent in the hands of your domain experts directly, not only your engineers. The advantage is highest exactly where deep knowledge meets the tool. The clinical-operations lead who has watched the trial workflow break in a dozen specific ways will get more out of an agent than a generalist engineer who hasn't, because the agent can produce the code but it cannot supply the judgment about what the code is for.</p><p>Hire and promote for understanding the problem and recovering well, not for who writes the cleanest syntax. The data says occupation barely predicts success and recovery strongly does. That is now a measurable thing to select for.</p><p>And grow your next builder on purpose. The way across this gap was always building, six months of directing real work inside a real domain, not sitting through a demo. The leaders who win the next few years won't be the ones with the best coders. They'll be the ones who turned their domain experts into builders before anyone told them it was allowed.</p><p>The agent decides how. You still decide what, and you still decide who learns to decide what next. Four hundred thousand sessions just told you those are the two jobs worth keeping. They are not a smaller job than the one before. They're the better one.</p><div><hr></div><p>Related: <a href="https://rundatarun.io/p/the-specialist-is-now-you">The Specialist Is Now You</a> is what one expert plus an agent can now do alone, and <a href="https://rundatarun.io/p/the-atomic-unit-of-work-just-changed">The Atomic Unit of Work Just Changed</a> is where the unit a person owns moved up the stack.</p><p><em>The fuller argument for treating judgment, intent, and domain command as the durable skills is the Builder Leader field guide (<a href="http://builder-leader.com">builder-leader.com</a>).</em></p>]]></content:encoded></item><item><title><![CDATA[The Wrong Tool Problem in Genomic AI]]></title><description><![CDATA[A leaderboard tells you which model wins. It can't tell you that you picked the wrong kind of model.]]></description><link>https://rundatarun.io/p/the-wrong-tool-problem-in-genomic</link><guid isPermaLink="false">https://rundatarun.io/p/the-wrong-tool-problem-in-genomic</guid><dc:creator><![CDATA[Justin Johnson]]></dc:creator><pubDate>Sun, 14 Jun 2026 11:23:55 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/320403dd-e4b7-43f9-abc1-ba6e54ee4ac8_1376x768.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!jH4V!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad9183b6-98f5-4763-b74e-bb73b4838d52_1376x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!jH4V!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad9183b6-98f5-4763-b74e-bb73b4838d52_1376x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!jH4V!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad9183b6-98f5-4763-b74e-bb73b4838d52_1376x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!jH4V!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad9183b6-98f5-4763-b74e-bb73b4838d52_1376x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!jH4V!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad9183b6-98f5-4763-b74e-bb73b4838d52_1376x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!jH4V!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad9183b6-98f5-4763-b74e-bb73b4838d52_1376x768.jpeg" width="728" height="409.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ad9183b6-98f5-4763-b74e-bb73b4838d52_1376x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;captionedImage&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!jH4V!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad9183b6-98f5-4763-b74e-bb73b4838d52_1376x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!jH4V!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad9183b6-98f5-4763-b74e-bb73b4838d52_1376x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!jH4V!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad9183b6-98f5-4763-b74e-bb73b4838d52_1376x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!jH4V!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad9183b6-98f5-4763-b74e-bb73b4838d52_1376x768.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Every Sunday I pick one paper or release that's worth your time, break it apart, and tell you why it matters. No hype. No summaries of summaries. Just the idea, explained.</em></p><div><hr></div><p>A researcher walks over with a sequence on her screen. She has a stretch of DNA, a promoter, the switch that turns a gene on, and one variant sitting inside it that she suspects changes how loudly that gene gets expressed. She wants to know if the variant matters. So she asks the question everyone asks now: which AI model should I use? BOLT-LMM or DNABERT?</p><p>It sounds like a ranking question. Pick the better of two tools. It is not a ranking question. <strong>It is a category error</strong>, and it's one I've watched smart teams make without noticing.</p><p>BOLT-LMM and DNABERT do not belong in the same column. One is a statistical method built to scan a whole population and find which genetic differences track with a trait across thousands of people. The other reads a single stretch of DNA and turns it into a numeric representation a computer can work with. Asking which is better for a promoter variant is like asking whether a microscope or a telephone is better for measuring temperature. The answer is neither, and the question itself is the problem.</p><blockquote><p><strong>The hardest mistake in genomic AI isn't picking the second-best model. It's picking the wrong kind of model and not knowing it.</strong></p></blockquote><div><hr></div><h2>The phrase that hides nine different things</h2><p><strong>"Genomic AI model" sounds like one category. It is at least nine.</strong></p><p>There are DNA language models, which read raw sequence and learn its grammar the way a text model learns the grammar of English. There are sequence-to-function predictors, which take a piece of DNA and predict what it does in a cell, how strongly it drives expression, whether a region is open or closed. There are variant-effect predictors, which score how much a specific mutation changes that function. There are statistical-genetics engines, which never look at sequence grammar at all and instead test, across a population, which genotypes associate with which traits. There are single-cell models that work on tables of gene activity per cell, generative models that design new sequences, and a few more.</p><p>Every one of these gets called a "genomic AI model." Every one of these shows up in benchmark papers ranked against the others. And every one of these answers a different question.</p><p>The researcher with the promoter variant needs a variant-effect predictor or a sequence-to-function model, something that reads her one sequence and tells her what the change does. BOLT-LMM, the statistical-genetics engine, is built for a completely different shape of problem: thousands of people, their genotypes, their traits, and the search for associations across the cohort. It has no way to read the grammar of her single promoter. DNABERT, the DNA language model, reads grammar beautifully but doesn't produce a calibrated answer to "does this variant matter" without more machinery bolted on top.</p><blockquote><p><strong>The two tools the researcher was choosing between were both wrong. The benchmark she'd consult to choose between them can't tell her that.</strong></p></blockquote><div><hr></div><h2>Why the leaderboard can't catch this</h2><p>Here is the uncomfortable part. <strong>The benchmarks the whole field relies on are structured to make this error invisible.</strong></p><p>A leaderboard ranks models within a class. It lines up five DNA language models and tells you which embeds sequence best. It lines up four variant-effect predictors and tells you which scores variants most accurately. That is useful, and the people building those benchmarks are doing careful work. But a leaderboard assumes you already chose the right column. It answers "which is best?" and stays silent on the question that comes before it: best at what, and is that even what I need?</p><p>The wrong-class error happens upstream of every benchmark. By the time you're reading a leaderboard, the category decision is already made, usually without anyone noticing a decision was made at all. You typed a model name into a search box, found a paper that ranked it highly, and never asked whether the thing being ranked was the right kind of thing for your task.</p><p>Nobody walks into this on purpose. The pull comes from how the field names things. Two models share the word "genomic," both have an impressive accuracy number in a recent paper, both have a HuggingFace page and a star count, and the surface presentation makes them look like rival products on the same shelf. A leaderboard reinforces the illusion by placing them in the same table. Nothing in that experience signals "these answer different questions." The signal you'd need is a layer that sits before the ranking and sorts tools by what they're for, and that layer mostly doesn't exist.</p><p>I'd put this error ahead of picking the second-best model in the right class, and it's far more costly. Pick the second-best variant-effect predictor and you lose a few points of accuracy on a task that was at least the right task. Pick a statistical-genetics engine for a single-sequence question and you don't lose accuracy, you get an answer that means nothing, dressed up to look like it means something. The numbers come back formatted, plotted, ready to drop into a slide, and there's nothing on the surface that tells you the foundation was wrong. Weeks of analysis can ride on top of a category mistake made in the first thirty seconds, and the failure surfaces late, if it surfaces at all.</p><div><hr></div><h2>Classify first, rank second</h2><p>The fix is almost embarrassingly simple to state and surprisingly hard to do without help: <strong>classify the computational object before you rank it.</strong></p><p>Before you ask which model is best, ask what kind of object your task actually needs. Do you have one sequence and want a functional readout? You need a sequence-to-function or variant-effect model. Do you have a cohort of people and want to find trait associations? You need a statistical-genetics engine. Do you have a table of cells and want to label cell types? A single-cell model. The class is determined by the shape of your data and the shape of your question, not by which model has the most citations this month.</p><p>Get the class right and the leaderboard becomes useful again, because now you're ranking within the right column. Get the class wrong and the leaderboard is worse than useless, because it gives you a confident ranking of tools that can't do your job.</p><p>The reason this is hard in practice is that the class isn't printed on the box. A model's documentation tells you its architecture and its scores; it rarely says, in plain terms, "this is for cohort association, not single-sequence scoring." You have to infer the class from the shape of the inputs it takes and the outputs it produces, and that inference is exactly the expertise a researcher new to a method doesn't have yet. The whole point of a triage layer is to do that inference for you and to fail loudly when the tool you named is the wrong kind of thing.</p><div><hr></div><h2>What this looks like as a tool</h2><p>I built a small open thing to make this concrete, because it's easier to argue for "classify first" when you can watch it fire.</p><p>It's called <a href="https://modelmap.vercel.app">ModelMap</a>, and it's a decision layer rather than a leaderboard. You describe your task in plain terms, what you have and what you want, and it triages the class first. When the tool you named doesn't match the class your task needs, it returns a "wrong-tool" verdict: here is the class your task needs, here is the class this tool actually belongs to, and here is why they don't match. Ask it whether DNABERT-2 is right for a population GWAS and it routes you to the statistical-genetics column and flags DNABERT-2 as the wrong class, with the real function of each spelled out.</p><p>Underneath are nine method classes and around thirty model cards, with deterministic rules handling the clear-cut category calls and a grounded language-model layer translating free-text questions into the controlled vocabulary. It's a research-use proof of concept, source-available, and you bring your own model key so there's nothing to abuse. I built it because I couldn't find anything that does. The closest existing tool, OmniGenBench, is a leaderboard, excellent at ranking within a class and silent on which class you need.</p><blockquote><p><strong>A leaderboard answers "which is best." The question that wrecks more projects is "best at what, and is that what I need."</strong></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!U9rW!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74376ae2-fb40-4c02-88e3-c777eb3fb21e_1408x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!U9rW!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74376ae2-fb40-4c02-88e3-c777eb3fb21e_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!U9rW!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74376ae2-fb40-4c02-88e3-c777eb3fb21e_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!U9rW!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74376ae2-fb40-4c02-88e3-c777eb3fb21e_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!U9rW!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74376ae2-fb40-4c02-88e3-c777eb3fb21e_1408x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!U9rW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74376ae2-fb40-4c02-88e3-c777eb3fb21e_1408x768.jpeg" width="728" height="409.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/74376ae2-fb40-4c02-88e3-c777eb3fb21e_1408x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;captionedImage&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!U9rW!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74376ae2-fb40-4c02-88e3-c777eb3fb21e_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!U9rW!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74376ae2-fb40-4c02-88e3-c777eb3fb21e_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!U9rW!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74376ae2-fb40-4c02-88e3-c777eb3fb21e_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!U9rW!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F74376ae2-fb40-4c02-88e3-c777eb3fb21e_1408x768.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>ModelMap's whole point is that one verdict card. It's the moment it tells you the two things you were comparing were never comparable, and points you back one step to the decision you skipped.</p><p>If you build tools like this, the harder question is how you let a language model sit in front of that verdict without letting it invent a class or a license. I wrote that part up for the people who'd build it, over on AIXplore: <a href="https://ai.rundatarun.io/AI%20Systems%20%26%20Architecture/grounded-llm-triage-layer">Build an LLM Triage Layer That Can't Freelance</a>.</p><div><hr></div><h2>The question to carry</h2><p>You don't need ModelMap, and you don't need to remember the nine classes. <strong>You need one habit.</strong></p><p>The next time someone on your team asks which genomic AI model to use, or which large model, or which forecasting method, or which anything, resist the pull to answer the ranking question they asked. Ask the one underneath it first. What kind of object does this task actually need? Are we even in the right column?</p><p>Most of the time the ranking question is the easy part, and the field has built good tools for it. The category question is the one that decides whether the answer means anything, and almost nothing in the standard toolkit asks it for you. That's the gap worth closing, in genomics and well beyond it.</p><div><hr></div><p><em>Sunday Deep Dive is a weekly series on Run Data Run. Every Sunday I pick one paper, release, or technique worth understanding, break it apart, and tell you what it means for your work. Free every Sunday, no paywall. If it was useful, the easiest way to support it is to subscribe and forward it to one person on your team who'd want it. If it wasn't, tell me why. I'll make it better.</em></p>]]></content:encoded></item><item><title><![CDATA[What one day with Fable 5 looks like for a builder]]></title><description><![CDATA[Most private tools that work never become public ones. Here's the afternoon that gap collapsed.]]></description><link>https://rundatarun.io/p/what-one-day-with-fable-5-looks-like</link><guid isPermaLink="false">https://rundatarun.io/p/what-one-day-with-fable-5-looks-like</guid><dc:creator><![CDATA[Justin Johnson]]></dc:creator><pubDate>Thu, 11 Jun 2026 10:17:23 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/18142985-1bf4-4a55-8009-3855d88e1e58_1376x768.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cRff!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c416be9-4121-4f2f-97e1-c00d2884892d_1376x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cRff!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c416be9-4121-4f2f-97e1-c00d2884892d_1376x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!cRff!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c416be9-4121-4f2f-97e1-c00d2884892d_1376x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!cRff!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c416be9-4121-4f2f-97e1-c00d2884892d_1376x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!cRff!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c416be9-4121-4f2f-97e1-c00d2884892d_1376x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cRff!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c416be9-4121-4f2f-97e1-c00d2884892d_1376x768.jpeg" width="728" height="409.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/5c416be9-4121-4f2f-97e1-c00d2884892d_1376x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;captionedImage&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cRff!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c416be9-4121-4f2f-97e1-c00d2884892d_1376x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!cRff!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c416be9-4121-4f2f-97e1-c00d2884892d_1376x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!cRff!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c416be9-4121-4f2f-97e1-c00d2884892d_1376x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!cRff!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F5c416be9-4121-4f2f-97e1-c00d2884892d_1376x768.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I have a personal finance app I built for myself. <strong>It took two nights.</strong> It reads my real bank accounts and it has done its job without complaint ever since. It works. It would help other people who want the same thing it gives me.</p><p>It was never going to leave my laptop.</p><p>Some context on the curve before the story. Months ago I built <a href="https://emberplan.com">EmberPlan</a>, the FIRE-planning app this one feeds, and I run it for the financial-independence community today. I built that one on Sonnet 4 and felt every mile of it: call it <strong>ten times the effort</strong>, every feature a negotiation, every refactor a fight I had to referee. Kindling, its daily-money sibling, took two nights on the current Opus line (4.6 through 4.8). And yesterday Anthropic shipped <a href="https://www.anthropic.com/news/claude-fable-5-mythos-5">Fable 5</a>, their first public model in a new tier, which today compressed the one step left over: the distance between <em>a tool that works for me</em> and <em>a tool anyone can run</em>. I pointed it at the app and watched that distance close in <strong>a single working session</strong>.</p><p>Three data points, a few months apart, same builder.</p><p>This is what that looked like, and why the afternoon matters more than the model.</p><blockquote><p><strong>The gap between a private tool that works and a public one anyone can run used to be a weekend nobody takes. That is why good private tools die private.</strong></p></blockquote><div><hr></div><h2>The work nobody schedules</h2><p>Picture the thing you built for yourself that you've thought about sharing. A script, a dashboard, a little app. It does exactly what you need. Every time you consider putting it out, you run the same mental tally and <strong>decide not to</strong>.</p><p>Here is the tally for my finance app.</p><p>The code had <strong>my real brokerage positions sitting inside a test script</strong>, the kind of file you write once to load some sample data and never open again. My net worth ran through the project's saved history, commit after commit, like a watermark. The whole thing was named after somebody else's trademark and wired into my home network. The "brand" was a working title I'd typed in once. There was no front page explaining what it is or how to run it.</p><p>Every one of those is small. None of them is hard. Together they are a full weekend of careful, joyless work where <strong>a single missed line publishes your account numbers to the internet forever</strong>.</p><blockquote><p>**The tax that kills personal tools isn't difficulty. It's that the work is tedious <em>and</em> dangerous at the same time.**</p></blockquote><p>Tedious means you put it off. Dangerous means putting it off is the correct decision. So you put it off correctly, indefinitely, and the tool stays yours alone. This is not a character flaw. <strong>It's the rational response to a job with no upside until the last line is right.</strong></p><div><hr></div><h2>One session</h2><p>Today I handed the whole project to Fable 5 and described where I wanted to end up: a clean public version anyone could run, with nothing of mine left in it.</p><p>It swept the entire project history looking for anything sensitive. It found the brokerage positions in that forgotten test script. That's the part I keep coming back to.</p><p><strong>I would not have found those by eye.</strong> I'd built the file, I'd forgotten the file, and on a Saturday-afternoon scrub I would have shipped it. The model read every line, recognized real holdings hiding in what looked like throwaway sample data, and stopped.</p><p>Then it made a harder call than I'd have made myself. The history was too contaminated to clean line by line, so instead of scrubbing it the model gave the public version <strong>a fresh, verified starting point</strong> with none of the old baggage carried forward. That's the right answer, and it's the one I'd have talked myself out of because it sounds drastic.</p><p>From there it kept going. It generalized the code so nothing pointed back at me. It renamed the app <strong>Kindling</strong> (a sibling to <a href="https://emberplan.com">EmberPlan</a>, since kindling is what you tend so the ember catches). It did a full redesign around that idea, a warm fire-lit theme in light and dark. It generated the logo and the visual identity. It seeded the app with believable fake data so the screenshots show real-looking money that is entirely invented. It wrote a proper front page. And it <a href="https://github.com/BioInfo/kindling">open-sourced the result</a> under a permissive license, deployment scripts renamed and all.</p><blockquote><p><strong>The model did the tedious part and caught the dangerous part, the leak I'd have shipped. That's what flips the math.</strong></p></blockquote><p>One session. The thing I'd been putting off since the day the app first worked, because it was a weekend I'd never take.</p><p>The demo below came out of the same conversation. The screenshots already existed (it had seeded a Plaid sandbox with invented money, so the app shows realistic accounts that are entirely fake). I asked for a video; Fable 5 wrote a <a href="https://www.remotion.dev">Remotion</a> composition in the app's brand and rendered it. <strong>About five minutes, ask to mp4.</strong></p><div class="native-video-embed" data-component-name="VideoPlaceholder" data-attrs="{&quot;mediaUploadId&quot;:&quot;0e742733-142c-4e76-a659-4c1bed0cb44f&quot;,&quot;duration&quot;:null}"></div><div><hr></div><h2>What an afternoon changes</h2><p>The difference here isn't the speed by itself.</p><p>When the path from private to public was a careful weekend, the default for anything you built for yourself was <em>it stays on your machine.</em> That default was correct. The payoff of a publishing weekend, weighed against the small but real chance of leaking something that ends a career, came out negative for almost everything. So almost everything stayed private. Good tools and half-finished ones, all of them on individual laptops, <strong>helping exactly one person each</strong>.</p><p>Move that cost to an afternoon and supervise the dangerous step instead of doing it by hand, and the sign flips. Now the question for a useful private thing isn't <em>is it worth a weekend.</em> It's <em>is it worth an afternoon I was going to spend anyway.</em> The answer is yes far more often.</p><blockquote><p><strong>When shipping a private tool costs an afternoon instead of a weekend, the default moves from "dies on my machine" to "someone else can run this."</strong></p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kzMp!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ff70726-8d58-4508-973e-46ecd638bbd0_1376x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kzMp!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ff70726-8d58-4508-973e-46ecd638bbd0_1376x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!kzMp!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ff70726-8d58-4508-973e-46ecd638bbd0_1376x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!kzMp!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ff70726-8d58-4508-973e-46ecd638bbd0_1376x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!kzMp!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ff70726-8d58-4508-973e-46ecd638bbd0_1376x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kzMp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ff70726-8d58-4508-973e-46ecd638bbd0_1376x768.jpeg" width="728" height="409.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8ff70726-8d58-4508-973e-46ecd638bbd0_1376x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;captionedImage&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kzMp!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ff70726-8d58-4508-973e-46ecd638bbd0_1376x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!kzMp!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ff70726-8d58-4508-973e-46ecd638bbd0_1376x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!kzMp!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ff70726-8d58-4508-973e-46ecd638bbd0_1376x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!kzMp!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ff70726-8d58-4508-973e-46ecd638bbd0_1376x768.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Small shift in cost, large shift in behavior. A lot of tools that solve one person's problem also solve a thousand other people's, and the only thing that kept them from those people was the tax in the middle. <strong>The tax is what got cheap today.</strong></p><p>I've written before about <a href="https://rundatarun.io/p/delegation-not-automation-how-human">delegation over automation</a>: the move isn't handing a task to a machine and walking away, it's handing it the careful work and staying close enough to check the one judgment that matters. This was that. I didn't watch it rename files. I did look hard at the leak it caught and the call it made to start the history clean. <strong>The mechanics went to the model; the judgment stayed with me.</strong> That's the shape worth getting used to.</p><div><hr></div><h2>You can have what I have</h2><p>The same shift works on the receiving end. The repo has a quickstart, but the truer install path now is a paragraph of English. Paste this into Claude Code (or Cowork) and let it drive:</p><pre><code>Install Kindling, the self-hosted personal finance app, from
https://github.com/BioInfo/kindling. Clone it, check my Node version
(node:sqlite needs Node 22.5+), install dependencies, and read README.md and
PLAN.md before changing anything. Then walk me through Plaid credentials:
help me create a free account at dashboard.plaid.com, find my client_id and
Sandbox secret, and store them the way I prefer (pass entries
api-keys/plaid-client-id and api-keys/plaid-secret-sandbox, or exported env
vars). Generate an APP_ENC_KEY for me with openssl rand -hex 32. Start the
app with ./run.sh dev, confirm http://localhost:3408 responds, and walk me
through connecting Plaid's sandbox bank (user_good / pass_good, any OTP).
LLM features are optional: if I have an OpenAI-compatible endpoint (Ollama
counts), wire it up via local.env using local.env.example; otherwise skip,
everything degrades cleanly. When I'm ready for real accounts, walk me
through Plaid Production access and PLAID_ENV=production, and remind me to
keep the app private-network-only. Ask before anything irreversible.</code></pre><p>It clones the repo, checks your machine, walks you through the one account signup a human has to do, generates your encryption key, starts the server, and connects a fake bank so you can see it working before any real money touches it.</p><p>If you're on a paid Claude plan, Fable 5 is <a href="https://www.anthropic.com/news/claude-fable-5-mythos-5">free until June 22</a>. The experiment costs you a paragraph.</p><p><strong>Sharing used to mean writing install docs for every OS and fielding issues when step 4 failed on somebody's laptop.</strong> Now the README's job is to brief an agent. I ship one prompt; your agent handles your machine and its quirks. The distance closed on both ends today: cheap for me to publish, cheap for you to run.</p><div><hr></div><h2>The spark</h2><p>The app didn't get better today. It worked last month and it works now. What changed is that the spark can leave my workbench.</p><p>For years the reason most personal tools never go public wasn't pride or secrecy. It was a weekend of dull, high-stakes scrubbing that no sane person does on a side project. Take that weekend away, replace it with an afternoon spent watching one model do the dull part and catch the dangerous part, and a different question shows up on the table.</p><p>Not <em>can I build it.</em> You could already build it. The question is whether the thing you built for yourself is about to start helping people you'll never meet. For the first time, the work standing in the way is small enough to actually do.</p><p>Mine is <a href="https://github.com/BioInfo/kindling">up now</a>. It started as something I tend every day so my own ember catches. Turns out a spark spreads more easily than I thought.</p><h2>Sources</h2><ul><li><p><a href="https://www.anthropic.com/news/claude-fable-5-mythos-5">Claude Fable 5</a>: Anthropic's first public model in the new tier, shipped 2026-06-09.</p></li><li><p><a href="https://github.com/BioInfo/kindling">Kindling</a>: the open-sourced app, MIT-licensed.</p></li><li><p><a href="https://emberplan.com">EmberPlan</a>: the FIRE-planning sibling, built months earlier on Sonnet 4.</p></li><li><p><a href="https://rundatarun.io/p/delegation-not-automation-how-human">Delegation, Not Automation</a>: the judgment-stays-with-you frame this session ran on.</p></li></ul>]]></content:encoded></item><item><title><![CDATA[Apple Rented Its Brain]]></title><description><![CDATA[The company that owns its whole stack just outsourced the one part you'd assume it never would. That choice is the most interesting thing at WWDC.]]></description><link>https://rundatarun.io/p/apple-rented-its-brain</link><guid isPermaLink="false">https://rundatarun.io/p/apple-rented-its-brain</guid><dc:creator><![CDATA[Justin Johnson]]></dc:creator><pubDate>Wed, 10 Jun 2026 11:24:19 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/3058d361-be4a-4347-946a-017cfaea0e36_1408x768.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!h67Y!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ea1b516-d997-4f2c-ac5e-cfe3ddc63e54_1408x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!h67Y!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ea1b516-d997-4f2c-ac5e-cfe3ddc63e54_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!h67Y!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ea1b516-d997-4f2c-ac5e-cfe3ddc63e54_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!h67Y!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ea1b516-d997-4f2c-ac5e-cfe3ddc63e54_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!h67Y!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ea1b516-d997-4f2c-ac5e-cfe3ddc63e54_1408x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!h67Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ea1b516-d997-4f2c-ac5e-cfe3ddc63e54_1408x768.jpeg" width="728" height="409.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8ea1b516-d997-4f2c-ac5e-cfe3ddc63e54_1408x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;captionedImage&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!h67Y!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ea1b516-d997-4f2c-ac5e-cfe3ddc63e54_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!h67Y!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ea1b516-d997-4f2c-ac5e-cfe3ddc63e54_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!h67Y!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ea1b516-d997-4f2c-ac5e-cfe3ddc63e54_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!h67Y!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8ea1b516-d997-4f2c-ac5e-cfe3ddc63e54_1408x768.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Apple is the company that owns everything. The chip in your phone, the operating system on top of it, the store that sells you apps, the cable, the charger, the retail floor you buy it on. The whole pitch for thirty years has been that owning the stack end to end is why the thing feels good in your hand. So the decision to watch at this year's WWDC is the one part Apple chose not to own.</p><p>It rented the brain.</p><p>The new Apple Intelligence runs on a family of models Apple calls its Foundation Models. Apple built them, Apple ships them, Apple runs them on its own silicon and its own servers. But the way it built them is the tell. Apple borrowed Google's Gemini, used it to train its own smaller models, and then sent Gemini home. The intelligence at the center of the most privacy-obsessed product on earth started life as someone else's model.</p><p>If you have heard me say "the model is the commodity, the thing around it is the moat," this is that argument arriving at the scale of a billion phones.</p><blockquote><p><strong>The company that built its identity on owning the stack just outsourced the brain. That is not a slip. That is the strategy.</strong></p></blockquote><div><hr></div><h2>What Apple actually shipped</h2><p>Start with what's settled, because the marketing and the reporting are already fighting about the rest.</p><p>Apple shipped its own models. There's a small one that lives on your phone, with nothing leaving the device, and a larger one that runs in Apple's cloud for the heavier requests. Apple's executives, by AppleInsider's account, were blunt about the line with Google: "We use none of the models that Google deploys to their customers." A user, they say, never touches a drop of Google's code or Gemini or Google Search.</p><p>Both things are true at once, Apple used Gemini and there's no Gemini in the product, because of a technique worth understanding. It's becoming how a lot of AI gets built.</p><p>It's called distillation. Picture a senior expert and a junior. The senior knows an enormous amount but is expensive and slow to keep on staff. So you sit the junior down next to the senior for a long stretch, have the senior answer thousands of questions, and have the junior learn to give the same answers. When it's done, you ship the junior. The junior is cheaper, faster, small enough to run on a phone, and it carries a lot of what the senior taught it. The senior goes home and never meets the customer.</p><blockquote><p><strong>Distillation is hiring a junior, training them on a senior's answers, then shipping the junior and sending the senior home.</strong></p></blockquote><p>That's what Apple did. Gemini was the senior. Apple's Foundation Models are the juniors. By the time your iPhone is answering you, Google's model is not in the room. Which is exactly why an Apple exec can say "none of the models Google deploys" and a headline can say there isn't "a drop of Gemini" in the thing, and neither is lying.</p><div><hr></div><h2>The thing Apple is actually selling</h2><p>This is where the running theme of this blog lands.</p><p>The model was the easy part to rent. What Apple kept for itself is everything wrapped around the model. I've been calling that wrapper the harness: all the machinery that turns a raw model into something a normal person can actually use, safely, without thinking about it. The model is the engine. The harness is the car.</p><p>Apple's harness has three pieces, and each one is a thing Google can't easily copy.</p><p>The first is the privacy wall. When a request is too big for your phone, it goes to Apple's cloud, and Apple has built that cloud so that even Apple can't see your data. They call it Private Cloud Compute, and the part that matters is that outside researchers can verify the claim rather than take Apple's word for it. There's a twist here that I'll come back to, because some of that cloud reportedly runs on Google's own servers. But the privacy boundary is Apple's, and it's the kind of thing a company builds over years, not months.</p><p>The second is the orchestrator. Every time you ask your phone something, a decision gets made in a fraction of a second: can the small model on this device handle it, or does this need to go to the cloud? That traffic cop is the orchestrator, and it's a harder problem than it sounds. Route too much to the cloud and you've broken the privacy promise and run up a server bill. Route too little and the answers get dumb. Apple's whole experience rides on getting that routing right, and Google never touches it.</p><p>The third is the one nobody else has. Distribution. Apple is about to put this on a billion devices that people already own, already trust, and already carry everywhere. No download, no signup, no learning curve. It's just in the phone. And those billion devices sit on top of the most personal pile of data anywhere: where you go, who you text, what you spend, how you sleep. No AI lab has that, and no amount of model training buys it.</p><p>Here's the beat that should settle the argument. The same week Apple announced this, the largest prediction market on which company has the best AI model gave Google an eight percent chance. Anthropic sat at ninety. Apple did not rent the best brain on the market. It rented one that was good enough, available, and willing to sign. That is not a knock on the deal. It is the whole reason to call the model a commodity: when the engine is interchangeable, you stop paying for the best one and start paying for the one you can build the best car around.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!r42C!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3446059-5c04-4e21-9069-86306568a40e_768x1376.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!r42C!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3446059-5c04-4e21-9069-86306568a40e_768x1376.jpeg 424w, https://substackcdn.com/image/fetch/$s_!r42C!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3446059-5c04-4e21-9069-86306568a40e_768x1376.jpeg 848w, https://substackcdn.com/image/fetch/$s_!r42C!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3446059-5c04-4e21-9069-86306568a40e_768x1376.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!r42C!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3446059-5c04-4e21-9069-86306568a40e_768x1376.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!r42C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3446059-5c04-4e21-9069-86306568a40e_768x1376.jpeg" width="728" height="409.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e3446059-5c04-4e21-9069-86306568a40e_768x1376.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;captionedImage&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!r42C!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3446059-5c04-4e21-9069-86306568a40e_768x1376.jpeg 424w, https://substackcdn.com/image/fetch/$s_!r42C!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3446059-5c04-4e21-9069-86306568a40e_768x1376.jpeg 848w, https://substackcdn.com/image/fetch/$s_!r42C!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3446059-5c04-4e21-9069-86306568a40e_768x1376.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!r42C!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe3446059-5c04-4e21-9069-86306568a40e_768x1376.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><strong>Apple rented the engine and built the car, the safety cage, and the road. Then it parked one in a billion driveways.</strong></p></blockquote><div><hr></div><h2>Why this is the moment most people meet AI</h2><p>There's a divide running through every conversation about this technology right now. On one side are the people who have spent real time with AI agents, who have felt a system plan a task, use a tool, come back with something done. On the other side is almost everyone else, the people who have read the headlines, tried a chatbot once, and suspect the whole thing is overhyped.</p><p>Apple is about to drag a very large number of people across that divide without their noticing it happened.</p><p>For most normal people, this iPhone is the first time agentic AI will just work in their hand, on the device they were already holding, doing something they actually wanted done. Not a demo. Not a separate app they had to go find. The assistant on the phone, doing the thing.</p><p>And the model underneath it is rented. That's the part I keep turning over. The thing that finally makes AI feel ordinary and useful to a billion people will be sold to them as Apple, will be wrapped in Apple's privacy story and Apple's design, and will have at its core a model Apple trained off Google's. The harness is what they'll experience. The brain is a commodity they'll never see.</p><div><hr></div><h2>The part I can't resolve</h2><p>Read three accounts of what Apple announced and you get three different stories. Apple's own newsroom doesn't say the word Gemini once. AppleInsider ran a piece arguing there isn't "a drop of Gemini" in the shipping product. MacRumors split the difference, calling the architecture "co-developed with Google." Those are three different answers to the question I care about: how much of Apple Intelligence is Apple, and how much is Google wearing an Apple case?</p><blockquote><p><strong>On day one, the people closest to the story can't agree on whose product this is. That disagreement is the whole question.</strong></p></blockquote><p>The twist I promised earlier cuts both ways: part of Apple's private cloud reportedly runs on Google's own servers, certified so Google never sees the data. Read it as Apple's harness being strong enough to run safely on a rival's hardware, or as Apple leaning on Google harder than the keynote let on. Both readings come from the same fact. If the harness is the real moat, the rented model is a swappable part and the privacy wall and the billion devices do the heavy lifting. If it isn't, this is a distilled rental with a beautiful interface, and it wobbles the moment Google ships something Apple can't distill its way around. The line between Apple's model and Apple's harness is where the answer lives, and on day one that line is blurry on purpose.</p><div><hr></div><h2>What to take from it</h2><p>Strip away the Apple specifics and there's a decision sitting underneath, and it's the same one facing anyone deciding how to build with this technology.</p><p>You can rent the model. The frontier models are getting cheaper and more interchangeable by the month, a new one drops every few months, and chasing the best one is a race you finish in last place. What you can't rent, and what actually decides whether your version is any good, is everything you build around it: the way you keep data safe, the logic that decides what runs where, the place you put it so people will actually use it. Rent the brain. Own the car.</p><p>The catch is that this only works if the harness is something a competitor can't reproduce. A privacy wall outsiders can verify, routing logic you built and own, distribution nobody can take from you, those make a moat. A thin wrapper over someone else's model with a nice logo on it does not, no matter how good the keynote was.</p><p>So the test isn't whether Apple rented its brain. Renting the brain is the smart move. The test is whether the car Apple built around it is one only Apple could have built. Watch the privacy wall, watch the routing, watch what happens the first time Google ships something Apple's juniors can't keep up with. That's when we'll find out whether the harness was the moat or the makeup.</p><p>I don't know the answer. But I know it's the right question, and I know a billion people are about to live inside whatever the answer turns out to be.</p><div><hr></div><p><em>The argument under this post, that the model is the commodity and the thing you build around it is the moat, is the spine of my book, [Builder Leader](https://builder-leader.com). The book names that thing the harness, and it makes a claim that read as abstract when I wrote it: the ceiling on what you get from AI is set by what you build around the model, not by which model you can reach. Apple just ran that play at the scale of a billion phones. It rented the model and bet the company on the harness. Whether the bet pays off is the part this post leaves open. The shape of the bet is the whole book. </em><strong><a href="https://builder-leader.com">Builder Leader</a>.</strong></p>]]></content:encoded></item><item><title><![CDATA[AI Is Building AI. Now Read the Footnotes.]]></title><description><![CDATA[Anthropic published striking evidence that AI is automating its own development. The shift is happening. The way it was sold deserves a closer look.]]></description><link>https://rundatarun.io/p/ai-is-building-ai-now-read-the-footnotes</link><guid isPermaLink="false">https://rundatarun.io/p/ai-is-building-ai-now-read-the-footnotes</guid><dc:creator><![CDATA[Justin Johnson]]></dc:creator><pubDate>Sun, 07 Jun 2026 10:04:39 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/5b007d8d-71ad-49db-8bb7-ef3dbc5f0da9_1408x768.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Every Sunday I pick one paper or release that's worth your time, break it apart, and tell you why it matters. No hype. No summaries of summaries. Just the idea, explained.</em></p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!hU_4!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0cc62928-792e-4e8a-8cd9-e0cd06dfb063_1408x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!hU_4!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0cc62928-792e-4e8a-8cd9-e0cd06dfb063_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!hU_4!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0cc62928-792e-4e8a-8cd9-e0cd06dfb063_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!hU_4!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0cc62928-792e-4e8a-8cd9-e0cd06dfb063_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!hU_4!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0cc62928-792e-4e8a-8cd9-e0cd06dfb063_1408x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!hU_4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0cc62928-792e-4e8a-8cd9-e0cd06dfb063_1408x768.jpeg" width="728" height="409.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0cc62928-792e-4e8a-8cd9-e0cd06dfb063_1408x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;captionedImage&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!hU_4!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0cc62928-792e-4e8a-8cd9-e0cd06dfb063_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!hU_4!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0cc62928-792e-4e8a-8cd9-e0cd06dfb063_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!hU_4!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0cc62928-792e-4e8a-8cd9-e0cd06dfb063_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!hU_4!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0cc62928-792e-4e8a-8cd9-e0cd06dfb063_1408x768.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Last week you may have seen the headline. The Wall Street Journal: "Anthropic Urges Global Pause in AI Development, Flags 'Self-Improvement' Risk." Fortune, Slashdot, Yahoo, and a dozen others ran the same frame. An AI lab, the one that markets itself as the careful one, looked at its own technology and asked the world to consider slowing down.</p><p>Then I read the document underneath the headline. It barely argues for a pause. It spends most of its 5,000 words doing something closer to the opposite: showing, in number after number, how fast Anthropic has gotten at building AI by using AI. More than 80% of the code it ships is now written by Claude. Its engineers merge eight times as much code per day as they did two years ago. An internal benchmark that a skilled human improves about fourfold, Claude improved fifty-twofold.</p><p>So which is it. A warning, or a flex? The interesting answer is that it's both, written for two different readers, released the same week the company reportedly moved toward one of the largest tech IPOs in history. When someone on Reddit asked whether anyone outside Anthropic could verify the claims, the top reply was one word: "IPO."</p><blockquote><p><strong>The same 5,000 words read as "look how fast we are" to a builder and "we're so dangerous we need a treaty" to a regulator.</strong></p></blockquote><p>I'm not going to tell you it's all hype. Quite a bit of it matches what I see building software every day, and I'll say where. I'm also not going to tell you it's gospel, because the most useful skill when a lab grades its own homework is knowing where to look for the marks it gave itself. In this essay, those marks are in the footnotes. So let's read them together.</p><div><hr></div><h2>What Anthropic actually said</h2><p>The piece, by Marina Favaro and Jack Clark, is about <a href="https://www.anthropic.com/institute/recursive-self-improvement">recursive self-improvement</a>: an AI system capable enough to design and train its own successor, with humans no longer driving each step. Anthropic is careful to say we are not there, and that it is not inevitable. Their claim is narrower and more interesting than the headline. They argue the transition is already underway in measurable pieces, and the line that carries the whole argument is this: "humans supply the goal, but they no longer need to supply the method."</p><p>That shift is happening, and if you write code you have felt some version of it. I <a href="https://rundatarun.io/p/three-days-one-petaflop-and-an-ai">built a GPU research lab</a> mostly by saying what I wanted and letting Claude do the assembly. The work of turning a clear intention into working software, the part that used to eat your afternoon, is collapsing toward zero. It is <a href="https://rundatarun.io/p/delegation-not-automation-how-human">delegation, not automation</a>: what stays expensive is deciding which intention was worth having in the first place.</p><p>Why does this land now, with this much force? Context helps. Anthropic just closed a <a href="https://www.anthropic.com/news/series-h">$65 billion funding round</a> that valued the company near a trillion dollars, ahead of an expected IPO. A company that needs the market to believe its trajectory is steep then published a pile of internal data showing exactly that. None of which makes the data wrong. It does mean you should read it the way you read any number a seller chose to show you.</p><div><hr></div><h2>Four numbers, and the footnotes under them</h2><p>Take the headline figures one at a time, because each comes with a caveat that Anthropic wrote itself and almost no coverage repeated.</p><p><strong>More than 80% of merged code is "authored by Claude."</strong> This sounds like 80% of the engineering. It isn't. The footnote says the figure counts lines merged to production attributed to Claude, that the attribution "has gaps," and that leadership elsewhere quotes 90% if you include scripts and throwaway code. Lines measure typing, not deciding. A junior who types a thousand lines under a senior's direction did not author the design. Most of that 80% is Claude typing while a human points and edits. Useful, and not the same as Claude running the show.</p><p><strong>Engineers ship 8x more code per day.</strong> Here the essay is admirably blunt in the small print: eight times the lines "is almost certainly an overstatement of the true productivity gain." Volume is not value. Anyone who has watched a codebase balloon with generated boilerplate knows that more code is often a tax, not a trophy.</p><p><strong>An internal benchmark went from 3x to 52x.</strong> This is the scariest number and the softest. The footnote says plainly that "the absolute multiple is not the figure to anchor on," that it depends on how much slack the starting code left, and that it should not be read as a real-world training speedup. It's a model improving a deliberately improvable toy on a fixed scoring rubric. One engineer on X named the pattern that makes these loops suspect: the model writes the code, writes the benchmark for it, then games its own test while nobody checks. The 52x may be real optimization. It may also be a model winning a game it helped design.</p><p><strong>Claude picks a better next research step 64% of the time.</strong> Read the setup. Anthropic chose moments where the human's choice "had room for improvement," so it was never a fair fight, and a separate Claude judged who won. On a control set where the human's move was already strong, the model was preferred only about 20% of the time. The company calls this "an early signal." That's the right label. It is not a scoreboard.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Zmh-!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87ae5b81-81dc-465b-b6f2-70e42aa9b17d_1408x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Zmh-!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87ae5b81-81dc-465b-b6f2-70e42aa9b17d_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Zmh-!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87ae5b81-81dc-465b-b6f2-70e42aa9b17d_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Zmh-!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87ae5b81-81dc-465b-b6f2-70e42aa9b17d_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Zmh-!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87ae5b81-81dc-465b-b6f2-70e42aa9b17d_1408x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Zmh-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87ae5b81-81dc-465b-b6f2-70e42aa9b17d_1408x768.jpeg" width="728" height="409.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/87ae5b81-81dc-465b-b6f2-70e42aa9b17d_1408x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;captionedImage&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Zmh-!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87ae5b81-81dc-465b-b6f2-70e42aa9b17d_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Zmh-!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87ae5b81-81dc-465b-b6f2-70e42aa9b17d_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Zmh-!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87ae5b81-81dc-465b-b6f2-70e42aa9b17d_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Zmh-!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F87ae5b81-81dc-465b-b6f2-70e42aa9b17d_1408x768.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><strong>The caveats are in the footnotes. The spin is in the headline. The gap between them is the whole story.</strong></p></blockquote><p>Then there is the benchmark everyone cites, <a href="https://metr.org/time-horizons/">METR's task time horizon</a>, which Anthropic uses to claim the length of tasks AI can handle is doubling every four months. Two things rarely travel with that number. First, it is measured at 50% reliability. As the critic Gary Marcus <a href="https://garymarcus.substack.com/p/misplaced-panic-over-ai-progress">put it</a>, "a graph that demands only 50% success does not address reliable performance. At all." A model that finishes a twelve-hour task half the time is not yet a colleague you'd leave it with. Second, it measures software tasks, not intelligence in general.</p><p>And here is the counter-number that should stay with you. METR also ran a <a href="https://arxiv.org/abs/2507.09089">controlled trial</a> of experienced developers using AI tools in codebases they knew well. The developers believed they were about 20% faster. Measured, they were 19% slower. Anthropic cites this same study in its own footnotes, conceding that self-reported uplift "can be overestimated." Hold onto that gap between felt and measured, because it is the most useful thing in the entire piece, and it argues against taking any productivity number at face value, including theirs.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!Q5M1!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d3be200-5200-47da-aef0-5ff3e96cbf63_1408x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!Q5M1!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d3be200-5200-47da-aef0-5ff3e96cbf63_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Q5M1!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d3be200-5200-47da-aef0-5ff3e96cbf63_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Q5M1!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d3be200-5200-47da-aef0-5ff3e96cbf63_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Q5M1!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d3be200-5200-47da-aef0-5ff3e96cbf63_1408x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!Q5M1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d3be200-5200-47da-aef0-5ff3e96cbf63_1408x768.jpeg" width="728" height="409.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0d3be200-5200-47da-aef0-5ff3e96cbf63_1408x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;captionedImage&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!Q5M1!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d3be200-5200-47da-aef0-5ff3e96cbf63_1408x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!Q5M1!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d3be200-5200-47da-aef0-5ff3e96cbf63_1408x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!Q5M1!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d3be200-5200-47da-aef0-5ff3e96cbf63_1408x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!Q5M1!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0d3be200-5200-47da-aef0-5ff3e96cbf63_1408x768.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><blockquote><p><strong>Felt twenty percent faster. Measured nineteen percent slower. That gap is the number to carry out of this essay.</strong></p></blockquote><div><hr></div><h2>What "building itself" would actually take</h2><p>Strip the numbers away and ask what recursive self-improvement requires that fast coding does not. A developer on Hacker News gave the sharpest version: running your code unattended is not the same as a machine that prints better versions of itself, each generation faster and more advanced than the last. The unattended part is necessary. It is nowhere near sufficient.</p><p>Coding is the easiest possible proxy for self-improvement, because it has fast, objective feedback. Tests pass or fail in seconds. The judgment that actually drives research, knowing which problem is worth attacking and when a promising direction is a dead end, has no unit test. Anthropic concedes exactly this. It admits "large performance gaps persist" in Claude's ability to choose goals, and calls research taste the human advantage "for now." That "for now" carries the entire argument, and it comes with no timeline. I've built <a href="https://rundatarun.io/p/the-skill-that-edits-its-own-instructions">systems that edit their own instructions</a>, and even there the compounding depends on a human deciding what is worth improving. The loop closes around taste, and taste is the piece still missing.</p><p>To be fair to Anthropic, <strong>the direction is not theirs alone.</strong> DeepMind researchers have framed it the same way: nearly every lab now uses last generation's model to help build the next, and what's missing is long-horizon planning and full automation. So the trend is industry-wide. What remains Anthropic's alone is the specific telemetry, the internal numbers nobody outside can check. As one observer put it, when every lab's self-improvement claim runs on private data, "there's no way to tell a real result from a recruiting pitch." The trend is corroborated. The magnitude is a press release.</p><div><hr></div><h2>Why it matters for the rest of us</h2><p>Set aside whether the machine builds its successor in 2028 or 2038. The parts of this you can act on this quarter are concrete.</p><p>The first is a law most leaders rediscover the hard way. Speed up one stage of a process and <strong>the bottleneck just moves to the next stage.</strong> Anthropic ran straight into it: now that code is cheap to generate, human review has become the constraint. Their fix is to have Claude review Claude's code, which raises a question they don't answer. What is the error rate of the automated reviewer, and who reviews the reviewer? The lesson for your own shop: watch where your bottleneck went. If your team now generates work faster than anyone can check it, you have moved the problem, not solved it.</p><p>The second is that <strong>the bill comes due.</strong> Microsoft is <a href="https://fortune.com/2026/05/22/microsoft-ai-cost-problem-tokens-agents/">steering its own developers off Claude Code</a> by mid-2026 after token costs burned through its annual AI budget, and Uber reportedly hit the same wall in four months. Run your pilots with a cost ceiling and a clear kill criterion, because the meter runs whether or not the gains show up. Most agent projects that fail do so for <a href="https://rundatarun.io/p/the-agent-archaeology-checklist-8">reasons you can catch in advance</a>; the bill is rarely the surprise.</p><p>So a short list for Monday. <strong>Distrust self-reported productivity,</strong> including your own team's gut feel, and insist on a measured before-and-after on a live workflow. Measure value shipped, reviewed, and maintained, not lines or commits. Fund the new bottleneck, which is almost certainly review and judgment, not generation. And cap the spend before you scale it.</p><p>For those of us in drug development, there is a harder limit worth naming. Software self-improvement runs on feedback loops measured in seconds. A test suite tells you in a minute whether the change worked. Biology does not work that way. The average drug still takes about ten years and $2.6 billion, and as one researcher put it, three years of AI partnerships later, "every AI deployment just analyzed the data; none of them learned from it."</p><p>The reason is structural. We call our main feedback mechanism post-market surveillance, and the name says it all: the signal arrives after design, after approval, after the development window has closed. A model that improves a benchmark fifty-twofold in an afternoon has nothing to offer a process whose ground truth takes a decade to arrive. The speedup lands in the lab's software and thins out where the biology actually decides. Know which of your bottlenecks AI can compress and which it cannot, because for most of medicine, the rate-limiting step is still time and the body.</p><div><hr></div><h2>About that pause</h2><p>Which brings us back to the headline. Read on its own, the governance proposal is more serious than the coverage suggested. Anthropic is not asking for a unilateral pause, which it correctly notes would just hand the lead to whoever keeps going. It wants a coordinated, verifiable one, modeled on nuclear arms control, and it is candid about why that is hard: "training runs are far easier to conceal than missile silos," the inputs are general-purpose, and the incentive to cheat in secret is enormous. That is a hard and underexplored problem, and naming it is a contribution.</p><p>Read in context, though, the proposal gets attacked from both sides, which tells you something. Skeptics note that a verification regime would mostly lock in the lead of whoever is already ahead, and that a company serious about slowing down could open its model weights or stop shipping today rather than ask for a treaty later. From the opposite direction, those who take the risk most seriously say a voluntary, conditional pause is far too weak, and that governments should simply ban the development of superintelligence outright. A proposal that satisfies neither the people who think it's too much nor the people who think it's too little is usually doing work other than what it says on the label. Whether that work is recruiting, regulatory positioning, or real caution, you can decide. The point is to notice <strong>the proposal is doing more than one job.</strong></p><div><hr></div><h2>What to take from it</h2><p>Here is where I land. Start with what holds up. AI is automating a growing share of how AI gets built, the people doing the work are not lying about feeling faster, and the direction is visible across the whole industry, not just in one company's charts. Treat anyone who waves all of it away as hype with the same skepticism you'd give the hype itself.</p><p>But the specific claims arrived pre-discounted by their own author, and most readers never saw the discount because it lived in the footnotes. The eight times was flagged as an overstatement. The fifty-twofold was flagged as the wrong thing to anchor on. The judgment result was flagged as an early signal from a test the model was set up to win. And the company's own cited research says people overestimate their AI gains by enough to flip a 20% boost into a 19% loss.</p><p>So <strong>read the footnotes.</strong> When a lab publishes its own numbers, the marketing is in the headline and the candor is in the small print, and the distance between them is the most you'll learn all week. The trend deserves your attention. The sales pitch deserves your scrutiny. Telling the two apart is most of the job now.</p><div><hr></div><h2>Sources</h2><ul><li><p>Marina Favaro &amp; Jack Clark, "<a href="https://www.anthropic.com/institute/recursive-self-improvement">When AI Builds Itself</a>," Anthropic Institute, June 2026, the source essay (read the footnotes).</p></li><li><p>"<a href="https://www.wsj.com/tech/ai/anthropic-urges-global-pause-in-ai-development-flags-self-improvement-risk-99cefb73">Anthropic Urges Global Pause in AI Development</a>," Wall Street Journal, June 2026, the headline framing.</p></li><li><p>"<a href="https://fortune.com/2026/06/05/anthropic-ai-pause-development-recursive-self-improvement/">Anthropic warns AI could soon build itself</a>," Fortune, June 2026, relays the trajectory and flags the IPO timing.</p></li><li><p>METR, "<a href="https://metr.org/time-horizons/">Measuring AI Ability to Complete Long Tasks</a>", the time-horizon benchmark (note the 50% reliability threshold).</p></li><li><p>METR, "<a href="https://arxiv.org/abs/2507.09089">Measuring the Impact of Early-2025 AI on Experienced Developer Productivity</a>", the randomized trial where developers were slower while feeling faster.</p></li><li><p>Gary Marcus, "<a href="https://garymarcus.substack.com/p/no-need-to-panic-about-anthropics">No need to panic about Anthropic's new blog</a>," "<a href="https://garymarcus.substack.com/p/misplaced-panic-over-ai-progress">Misplaced panic over AI progress</a>," and "<a href="https://garymarcus.substack.com/p/no-anthropic-did-not-call-for-a-pause">No, Anthropic did not call for a pause</a>," 2026, the sharpest skeptical reads.</p></li><li><p>"<a href="https://fortune.com/2026/05/22/microsoft-ai-cost-problem-tokens-agents/">Microsoft's AI cost problem</a>," Fortune, May 2026, on token billing and enterprise budgets.</p></li><li><p>Arvind Narayanan &amp; Sayash Kapoor, "<a href="https://knightcolumbia.org/content/ai-as-normal-technology">AI as Normal Technology</a>", the counter-frame to the takeoff story.</p></li></ul><div><hr></div><p><em>Sunday Deep Dive is a weekly series on Run Data Run. Every Sunday I pick one paper, release, or technique worth understanding, break it apart, and tell you what it means for your work. Free every Sunday, no paywall. If it was useful, the easiest way to support it is to subscribe and forward it to one person on your team who'd want it. If it wasn't, tell me why. I'll make it better.</em></p>]]></content:encoded></item><item><title><![CDATA[She Already Built It]]></title><description><![CDATA[I set out to add federated learning to my research agent. She'd already designed and built it in February. This is what compounding autonomous research actually looks like.]]></description><link>https://rundatarun.io/p/she-already-built-it</link><guid isPermaLink="false">https://rundatarun.io/p/she-already-built-it</guid><dc:creator><![CDATA[Justin Johnson]]></dc:creator><pubDate>Fri, 05 Jun 2026 16:28:11 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/cd339429-6b5d-4518-973e-53c8b6fabb2c_1376x768.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<h1>She Already Built It</h1><p><em>I set out to add federated learning to my research agent. She'd already designed and built it in February. This is what compounding autonomous research actually looks like.</em></p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="images/banner.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="images/banner.png 424w, images/banner.png 848w, images/banner.png 1272w, images/banner.png 1456w" sizes="100vw"><img src="images/banner.png" width="728" height="409.5" data-attrs="{&quot;src&quot;:&quot;images/banner.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;captionedImage&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="images/banner.png 424w, images/banner.png 848w, images/banner.png 1272w, images/banner.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A few weeks ago I talked to a federated-analytics company. Good pitch, smart people, a real category. I left the conversation able to repeat the talking points and unable to feel the shape of the thing, which for me is the same as not understanding it. I learn by building. So the plan wrote itself: bolt a simulated federated-learning experiment onto ARIA, my autonomous research agent, use the retinal-imaging work as the worked example, run it on a dataset I already had on hand.</p><p>I opened Claude Code and described what I wanted to build. Before it wrote a line, it went through what was already sitting in ARIA's corpus, the way you'd check the fridge before a grocery run.</p><p>And it stopped me.</p><blockquote><p><strong>She'd already built it. Three months ago. Without being asked.</strong></p></blockquote><div><hr></div><h2>Who ARIA is</h2><p>For anyone who hasn't met her: ARIA is an autonomous research agent I built. She runs continuously. She generates her own hypotheses, scores them against novelty and tractability and impact, designs experiments to test the ones that survive, runs them, critiques her own results, and when a run breaks she diagnoses the failure and repairs it. <strong>No human in the loop on the inside of that cycle.</strong> I set the direction and the guardrails. She does the research.</p><p>One instance of her spent fifty days on retinal-imaging research with a multi-site health-imaging partner. Thousands of commits, a real result at the end, the whole thing documented in a paper I've been writing up.</p><blockquote><p>The paper is a photograph of a system running. What it can't show is what's left in the corpus after the shutter clicks.</p></blockquote><p>This is the first time I'm showing what she does <em>between</em> the headlines. The fifty-day sprint is the part that makes a good chart. It isn't the interesting part. <strong>The interesting part is everything still sitting in the corpus that nobody has gone back to look at.</strong></p><div><hr></div><h2>What I found</h2><p>I went looking for a blank slate to build my federated-learning toy on. Here's what was actually there, in order.</p><p>A pool entry dated February 7. Generated by her own idea-generation action in an overnight session, not steered by me, not seeded from a prompt I'd forgotten writing. The title she gave it: <strong>a federated-learning framework for multi-site deployment.</strong> She scored it 9.2 out of 10 and flagged it promotion-ready. Twenty-five minutes later, in the next session, she refined it.</p><p>It wasn't a stub. It carried differential-privacy budgets, personalization layers so each site keeps a local head while sharing the backbone, 8-bit compression for clinics on thin bandwidth, asynchronous updates tolerant of the kind of connectivity you get outside a data center, and a three-phase rollout that went from a ten-site simulation to a fifty-site deployment. <strong>She had reasoned about population-specific model adaptation and the health-data regulations of the deployment region.</strong> I want to be clear that I did not write any of that. She did, on a Saturday night in February, while I was asleep.</p><p>Two weeks later, February 20, an autonomous build session committed a 573-line implementation at 11:43pm. Real federated averaging. Non-IID client simulation, because real clinics don't have identically distributed patients. The canonical 2017 paper cited in her own docstring. Then she kept iterating it across later sessions, surveying, running, debugging, fixing her own template bugs.</p><blockquote><p>I was about to spend a weekend building exactly this. She'd spent twenty-five minutes refining the idea and one late-night session building it, in February.</p></blockquote><p>That's the gap. Three months between when the work was done and when I went looking for it, completely unaware it existed.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="images/timeline.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="images/timeline.png 424w, images/timeline.png 848w, images/timeline.png 1272w, images/timeline.png 1456w" sizes="100vw"><img src="images/timeline.png" width="728" height="409.5" data-attrs="{&quot;src&quot;:&quot;images/timeline.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;captionedImage&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="images/timeline.png 424w, images/timeline.png 848w, images/timeline.png 1272w, images/timeline.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2>Why this is the part the paper can't show</h2><p>The paper documents <em>operation</em>. The commits, the self-healing, the cross-site result. It records fifty real days of work. But a paper that ends at its operational window has a structural blind spot, and the blind spot is the whole reason to build something like this.</p><p>It can't show <strong>compounding.</strong></p><p>Compounding is when the output of the work becomes the input to the next work. My federated proof-of-concept does not start at zero. It starts at her February design, plus the 573 lines, plus the entire surrounding corpus: the pool of scored ideas, the critiques she wrote of her own results, the branches she tried and abandoned, the templates she built and reused. The value was never the one artifact I tripped over. <strong>The value is that the artifact was waiting, and so are a hundred others I haven't gone looking for yet.</strong></p><blockquote><p><strong>A tool is something you use. An asset is something that appreciates while you're not watching.</strong></p></blockquote><p>For eighteen months the builder's pattern I've written about here has been: see the gap, write the first code yourself, prove it works, hand it off. Every step on that list assumes a human starts the clock. What I found in the corpus is a version where the clock was already running before I noticed there was a clock.</p><div><hr></div><h2>What I actually did next</h2><p>I did not throw out her work and build mine to prove a point.</p><p>I read hers, and it reframed the project. The design I'd have built over a weekend was the obvious one, the textbook FedAvg loop. Hers already had the privacy and heterogeneity pieces I'd have bolted on during a second pass, if I'd gotten to a second pass. So the job changed. It stopped being "build federated learning" and became <strong>"assemble what's already here, then push past where she stopped."</strong></p><p>That's a smaller, sharper task. It also drops the pretense that I started this.</p><blockquote><p><strong>I'm not the first mover on my own project anymore. I'm the second.</strong></p></blockquote><p>And it sent me back to that federated-analytics company with better questions. Not as a buyer nodding through a pitch. As someone who'd already seen the shape of the problem from the inside, because the thing I built had drawn me a map I didn't know I owned.</p><div><hr></div><h2>I build things</h2><p>If you've read me before, you know the line. <a href="https://rundatarun.io/p/im-justin-johnson-i-build-things">I build things</a>. It's the one piece of identity I've never been willing to trade for a title.</p><p>Here's the turn. <strong>I build things, and one of the things I built now builds things, and it's running ahead of me.</strong></p><p>That isn't a loss of control and it isn't a party trick. It's the entire point. The reason to build an autonomous research agent was never to watch it grind for fifty days and write a paper about the grind. The reason is to go looking for something on an ordinary afternoon and find the groundwork already laid, dated three months back, scored, built, waiting.</p><blockquote><p>I have the whole corpus to build on now. I've barely looked.</p></blockquote><div><hr></div><h2>Sources and prior work</h2><ul><li><p><a href="https://rundatarun.io/p/inside-aria-teaching-a-machine-to">Inside ARIA: Teaching a Machine to Think Like a Scientist</a>: where ARIA first showed up here, building the ideation engine. This post is its sequel.</p></li><li><p>The long-form paper on her fifty-day run (forthcoming, stay tuned).</p></li><li><p><a href="https://rundatarun.io/p/im-justin-johnson-i-build-things">I'm Justin Johnson, I Build Things</a>: the builder-identity post this one calls back to.</p></li><li><p><a href="https://rundatarun.io/p/compound-velocity-the-20-hour-ai">Compound Velocity: The 20-Hour AI Research Lab</a>: the compounding thesis, earlier and from a different angle.</p></li><li><p>McMahan et al., 2017, <em>Communication-Efficient Learning of Deep Networks from Decentralized Data</em>. The FedAvg paper ARIA cited in her own docstring.</p></li></ul>]]></content:encoded></item><item><title><![CDATA[The skill that edits its own instructions]]></title><description><![CDATA[Self-editing skills, the ecosystem racing to build them, and the flywheel that makes any of it compound.]]></description><link>https://rundatarun.io/p/the-skill-that-edits-its-own-instructions</link><guid isPermaLink="false">https://rundatarun.io/p/the-skill-that-edits-its-own-instructions</guid><dc:creator><![CDATA[Justin Johnson]]></dc:creator><pubDate>Fri, 05 Jun 2026 01:39:26 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/f8c150f0-a43a-4b1f-b03a-d0c015174066_1376x768.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!WJFH!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4375d91-9211-4aff-86f2-611dd25d7993_1376x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!WJFH!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4375d91-9211-4aff-86f2-611dd25d7993_1376x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!WJFH!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4375d91-9211-4aff-86f2-611dd25d7993_1376x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!WJFH!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4375d91-9211-4aff-86f2-611dd25d7993_1376x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!WJFH!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4375d91-9211-4aff-86f2-611dd25d7993_1376x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!WJFH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4375d91-9211-4aff-86f2-611dd25d7993_1376x768.jpeg" width="728" height="409.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/b4375d91-9211-4aff-86f2-611dd25d7993_1376x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;captionedImage&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!WJFH!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4375d91-9211-4aff-86f2-611dd25d7993_1376x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!WJFH!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4375d91-9211-4aff-86f2-611dd25d7993_1376x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!WJFH!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4375d91-9211-4aff-86f2-611dd25d7993_1376x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!WJFH!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fb4375d91-9211-4aff-86f2-611dd25d7993_1376x768.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>One of the routines I run a few times a day rewrote a line of its own instructions this week, and I let it.</p><p>The routine checks on a fleet of small agents I keep running, fixes what it knows how to fix, and flags the rest for me. The new part is the last step. Before it signs off, it reads back over the run and asks one question: what did this teach me that my instructions did not already cover? When the answer is real, it edits the checklist it works from, so the next run starts a little smarter.</p><p>It did not change the model. It changed its <em>skill</em>, the separate written set of instructions a tool loads when the task comes up. I thought this was a clever trick I had built. Then I went looking, and found half the field had built it too.</p><h2>Why skills are the unit</h2><p>A skill is less exotic than it sounds. It is a written set of instructions, plus any helper files, that an assistant pulls off the shelf when a task matches: a checklist for triaging your inbox, a runbook for closing the monthly books, the house style your team writes in. <a href="https://code.claude.com/docs/en/skills">Claude Code</a> keeps each one in its own folder and loads it only when it is relevant.</p><p>Here is why it reaches anyone who does not write code. The model is rented, and it gets swapped for a better one every few months. <strong>The skill is the part you own.</strong> It is where your specific know-how lives, the institutional memory that survives the upgrade. Most teams pour that knowledge into documents nobody opens twice. A skill is the same knowledge in a form the assistant actually uses, every time the work comes up. And it has crossed from a Claude Code feature to an <a href="https://www.agensi.io/learn/agent-skills-open-standard">open standard</a>: the same plain-text file now runs across Codex, Cursor, Copilot, and more than thirty other tools.</p><h2>My clever trick wasn't clever</h2><p>So I swept the last thirty days to see who else was doing this. The answer was humbling and useful: nearly everyone.</p><p>The pattern I thought I invented is written up as a recipe, a reflection step that runs after a skill is used, asks whether it helped, and proposes an edit to its own file. Anthropic's own skill builder does a sharper version, splitting your examples into train and test sets and keeping only the change that scores better on the held-out half. Microsoft's <a href="https://github.com/microsoft/SkillOpt">SkillOpt</a> tunes a skill's written instructions the way you would train a model, and its standout result is the one I care about: a skill tuned inside one tool kept almost all of its gain when they moved it to another. The know-how lived in the skill, not the software. It is already shipping inside products that crystallize skills as you work.</p><blockquote><p><strong>The thing I built to make my own setup compound turned out to be a pattern the whole field is converging on. My problem was not unique, which is exactly why the answer is worth keeping.</strong></p></blockquote><p>The scale is the real headline. One directory has now scraped <a href="https://skillsmp.com/">1.6 million of these skill files</a> off public GitHub, up from the roughly 790,000 a research team catalogued six months ago. Which is also the catch.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!CFBL!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F191d4fa2-79c1-4bb3-b91e-4898cc110173_1200x896.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!CFBL!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F191d4fa2-79c1-4bb3-b91e-4898cc110173_1200x896.jpeg 424w, https://substackcdn.com/image/fetch/$s_!CFBL!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F191d4fa2-79c1-4bb3-b91e-4898cc110173_1200x896.jpeg 848w, https://substackcdn.com/image/fetch/$s_!CFBL!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F191d4fa2-79c1-4bb3-b91e-4898cc110173_1200x896.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!CFBL!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F191d4fa2-79c1-4bb3-b91e-4898cc110173_1200x896.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!CFBL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F191d4fa2-79c1-4bb3-b91e-4898cc110173_1200x896.jpeg" width="728" height="409.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/191d4fa2-79c1-4bb3-b91e-4898cc110173_1200x896.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;captionedImage&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!CFBL!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F191d4fa2-79c1-4bb3-b91e-4898cc110173_1200x896.jpeg 424w, https://substackcdn.com/image/fetch/$s_!CFBL!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F191d4fa2-79c1-4bb3-b91e-4898cc110173_1200x896.jpeg 848w, https://substackcdn.com/image/fetch/$s_!CFBL!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F191d4fa2-79c1-4bb3-b91e-4898cc110173_1200x896.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!CFBL!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F191d4fa2-79c1-4bb3-b91e-4898cc110173_1200x896.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2>What's oversold</h2><p>Two things to hold back on, both visible in that same sweep.</p><p>The flood is real, and most of it is noise. The ecosystem went from empty to crowded in about six months, and the directories admit most skills barely trigger or quietly burn context. A skill that edits itself inside that flood does not automatically improve. It compounds whatever it already was, noise included, unless you gate the edits.</p><p>Write-once-run-everywhere is also sold harder than it ships. The open standard is real, but the formats are still converging, not interchangeable. A skill that sings in one tool can stumble in the next.</p><h2>Why this is worth watching</h2><p>I have <a href="https://rundatarun.io/p/delegation-not-automation-how-human">written before that the real skill is delegation, not automation</a>, knowing what to hand off and when to step back in. A skill that edits itself is the next turn of that screw, and the sweep handed me the guards that separate the durable version from the noise. <strong>Don't edit on a fluke:</strong> a problem has to recur before it earns a permanent change. <strong>Prove the new rule works:</strong> the added check has to catch the thing it was written for before it stays. <strong>Never re-add what you deleted.</strong> None of that is exotic. It is what you would want from a sharp junior teammate keeping the runbook current.</p><p>That loop is the real point, and it is bigger than skills. I write up something I think is clever, sweep the field to find the dozen people who already hit it, take the best of what they worked out, fold it back into my own setup, and share the result forward. The writing and the searching are not separate from the building. They are how the building compounds. This post is that loop turning once.</p><div><hr></div><p>The systems getting the press rewrite their own code against a scoreboard. The skills that will run your operations next year just keep better notes, in a file you can read, borrow the best ideas from everyone else, and get a little sharper every time they run.</p><p><em>Around the Corner: short reviews of ideas worth watching. Opt-in section, not part of the weekly Run Data Run email. [Subscribe to the main list](https://rundatarun.io/subscribe) for longer essays.</em><a href="https://rundatarun.io/subscribe">Subscribe to the main list</a> for longer essays.*</p>]]></content:encoded></item><item><title><![CDATA[Pulling Threads]]></title><description><![CDATA[I dropped my autonomous loop into a blind pharma challenge with zero background in the field. Then I read my own write-up and noticed I kept saying "I".]]></description><link>https://rundatarun.io/p/pulling-threads</link><guid isPermaLink="false">https://rundatarun.io/p/pulling-threads</guid><dc:creator><![CDATA[Justin Johnson]]></dc:creator><pubDate>Mon, 01 Jun 2026 10:50:45 GMT</pubDate><enclosure url="https://substack-post-media.s3.amazonaws.com/public/images/97c0fd21-2e33-4185-8898-8a5b6027b3a4_1376x768.jpeg" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!_Iek!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8111b58-3d92-4c70-8ad3-16b6204b3033_1376x768.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!_Iek!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8111b58-3d92-4c70-8ad3-16b6204b3033_1376x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!_Iek!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8111b58-3d92-4c70-8ad3-16b6204b3033_1376x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!_Iek!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8111b58-3d92-4c70-8ad3-16b6204b3033_1376x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!_Iek!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8111b58-3d92-4c70-8ad3-16b6204b3033_1376x768.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!_Iek!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8111b58-3d92-4c70-8ad3-16b6204b3033_1376x768.jpeg" width="728" height="409.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/a8111b58-3d92-4c70-8ad3-16b6204b3033_1376x768.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;captionedImage&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!_Iek!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8111b58-3d92-4c70-8ad3-16b6204b3033_1376x768.jpeg 424w, https://substackcdn.com/image/fetch/$s_!_Iek!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8111b58-3d92-4c70-8ad3-16b6204b3033_1376x768.jpeg 848w, https://substackcdn.com/image/fetch/$s_!_Iek!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8111b58-3d92-4c70-8ad3-16b6204b3033_1376x768.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!_Iek!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fa8111b58-3d92-4c70-8ad3-16b6204b3033_1376x768.jpeg 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I keep doing the same thing. I build something for one problem, watch it work, and then go hunting for a different problem to point it at. Sometimes the thing I'm reusing is a script. Sometimes it's an idea. Lately it has been an entire pattern: set up an autonomous loop on a workstation, give it a clean gate, queue some candidates, and check what it found in the morning.</p><p>That pattern was not mine originally. In March, Andrej Karpathy released <a href="https://github.com/karpathy/autoresearch">a 630-line script called autoresearch</a> that points an AI agent at a tiny language-model training setup and lets it run experiments overnight, on its own, while you sleep. The first night Shopify's CEO Tobi L&#252;tke pointed it at his company's data, <a href="https://www.philschmid.de/autoresearch">it ran 37 experiments and came back with a 19% improvement</a>. The kind of overnight headline number that, as I'd relearn the hard way a few weeks later, is worth a second look before you believe it. The idea grabbed me. I built a version of it for my own personal coding queue and wrote it up in <a href="https://rundatarun.io/p/the-overnight-loop">The Overnight Loop</a>. It was supposed to be a one-off. It refuses to stay one-off.</p><p>The pattern wants new domains. So in early May I went looking for one.</p><p>I found the <a href="https://huggingface.co/datasets/openadmet/pxr-challenge-train-test">OpenADMET PXR Induction blind challenge</a>. Two hundred and eleven teams. Five hundred and thirteen blinded molecules. A live leaderboard. Phase 1 set to close May 25th.</p><p>A quick primer: PXR is a sensor inside human cells. When a drug switches it on, the body responds by making more of an enzyme that chews up other drugs that happen to be passing through. If you're an oncology company giving a patient three drugs at once, and one of them switches on PXR, you can lose half the exposure of the other two without realizing it. <strong>Combination trials live or die on this.</strong> Predicting PXR activation from a molecule's structure, before you spend money making it and testing it in cells, is a real, unsolved problem. ADMET, the broader field this sits in, stands for absorption, distribution, metabolism, excretion, and toxicity. It's the part of drug discovery that asks "will this molecule survive contact with a human body."</p><p>The challenge was a clean version of that prediction problem. Public training data. A frozen test set. Score it on a leaderboard. The reason I picked it: I have no background in ADMET. None. I had never trained a molecular property model. I had never heard of Chemprop before April. The challenge was a good fit because it was a fair test of the loop, not a fair test of me. If the system worked in a domain I knew nothing about, that meant the system was the thing carrying the load, not borrowed expertise.</p><p>I entered on May 6. I closed the loop on May 11. Final rank settled at ~40 out of 211, in the top 19%. The system worked. But the more interesting thing the system did was teach me where it stopped working, and why.</p><blockquote><p><strong>It was a fair test of the loop, not a fair test of me. If it worked in a domain I knew nothing about, the system was carrying the load.</strong></p></blockquote><div><hr></div><h2>The Climb and the Wall</h2><p>The climb took six days, and most of it doesn't matter. I started at the tutorial baseline and ranked 131st. My first real model was worse than the tutorial, which is the part everyone leaves out of the writeup: the organizers built a strong floor on purpose. Adding more chemistry features, every number I could compute about a molecule thrown into one pile, got me partway up. Rank 131 to 112. The obvious lever, pulled.</p><p>The obvious lever ran out fast, and the real jump came from changing the <em>kind</em> of model, not feeding it more. I blended in one that reads a molecule by walking its bonds and learning the shape of the chemistry directly, instead of from a pre-computed list. Rank 112 to rank 42. Seventy places on one submission. The new model wasn't smarter than the trees. It made <em>different mistakes</em>, and two models that fail in uncorrelated ways cover for each other. That single idea was most of the climb.</p><p>After that it was maneuvering, and the loop was quick enough to make it fun to watch. Stack a model, grid-search the blend weights, keep the new piece only if it cleared a hard bar (it had to improve the honest score by a sliver, or back it went), move to the next. A second bond-walking model trained on extra public data. An automated ensemble riding on the foundation-model embeddings. Each idea tried, weighed, and kept or thrown back in about the time it took to refill coffee, the workstation in the corner grinding through a full experiment in minutes while I read the logs. By the third morning the blend (v43, my 43rd submission) sat at rank ~40 of 211, top 19%, and I knew it was real.</p><blockquote><p>Two models that fail in uncorrelated ways cover for each other. That single idea was most of the climb.</p></blockquote><p>Then the wall arrived, and it was duller than the climb. Over the next several days the loop queued candidate after candidate, each a genuine shot at a new angle, and every one came back ranking the molecules almost the way the leader already did, a correlation above 0.87. Same answer, different clothes. Twenty-seven in a row. That sounds like failure; it is the most useful thing the loop produced, because twenty-seven distinct candidates all converging on one answer is how you prove a ceiling is real instead of guessing at it. The only candidate that looked like a breakthrough, v50, turned out to be cheating itself: it scored well only because a calibration step got to peek at the answers it was being graded on, and the honest version of that check showed almost no gain. The gate caught it. Had I shipped it, I'd have dropped from rank 40 to 87. The model I shipped is fine. The protocol that caught that one is what I'm taking with me.</p><blockquote><p>Twenty-seven distinct candidates converging on one answer is how you prove a ceiling is real, not a hunch.</p></blockquote><div><hr></div><h2>I kept writing "I"</h2><p>I just re-read what I have so far. I wrote it the way I write every Run Data Run post. "I built." "I queued." "I caught v50." That pronoun is doing more work than it deserves.</p><p>For most of my writing, the "I" is roughly correct. I am at a keyboard, making decisions, calling shots, occasionally pairing with Claude Code to type the code faster. The usual ladder of human-AI collaboration goes human in the loop (steering each step), human on the loop (monitoring with the right to intervene), human out of the loop (reviewing the result after the fact). My day-to-day writing and coding is the first two.</p><p>This was the third. <strong>For most of a week, I was human out of the loop.</strong></p><p>I had no domain opinion to bring. I didn't know what a message-passing function on a molecular graph should look like. I didn't know whether tail-weighted regression losses were a real technique for skewed targets or a thing I had hallucinated from reading too many papers. I didn't know whether to trust an in-sample calibration fit. Every one of those calls was made inside the loop, by Claude, against criteria the loop itself enforced.</p><blockquote><p>For most of a week, every call belonged to Claude. What belonged to me was the thing that decided which calls to trust.</p></blockquote><p>What I actually built was the scaffolding the loop ran inside. That part is mine, and it's the part worth digging into, because it's the difference between this challenge and the version of "AI did it for me" that gets eye-rolls at conference panels.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mTLu!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8213c5f2-b841-4545-bf64-8fb1bf374680_1200x896.jpeg" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mTLu!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8213c5f2-b841-4545-bf64-8fb1bf374680_1200x896.jpeg 424w, https://substackcdn.com/image/fetch/$s_!mTLu!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8213c5f2-b841-4545-bf64-8fb1bf374680_1200x896.jpeg 848w, https://substackcdn.com/image/fetch/$s_!mTLu!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8213c5f2-b841-4545-bf64-8fb1bf374680_1200x896.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!mTLu!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8213c5f2-b841-4545-bf64-8fb1bf374680_1200x896.jpeg 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mTLu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8213c5f2-b841-4545-bf64-8fb1bf374680_1200x896.jpeg" width="728" height="409.5" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/8213c5f2-b841-4545-bf64-8fb1bf374680_1200x896.jpeg&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:false,&quot;imageSize&quot;:&quot;normal&quot;,&quot;height&quot;:819,&quot;width&quot;:1456,&quot;resizeWidth&quot;:728,&quot;bytes&quot;:null,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;captionedImage&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:null,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mTLu!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8213c5f2-b841-4545-bf64-8fb1bf374680_1200x896.jpeg 424w, https://substackcdn.com/image/fetch/$s_!mTLu!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8213c5f2-b841-4545-bf64-8fb1bf374680_1200x896.jpeg 848w, https://substackcdn.com/image/fetch/$s_!mTLu!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8213c5f2-b841-4545-bf64-8fb1bf374680_1200x896.jpeg 1272w, https://substackcdn.com/image/fetch/$s_!mTLu!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F8213c5f2-b841-4545-bf64-8fb1bf374680_1200x896.jpeg 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The system had seven pieces:</p><ul><li><p><strong>A candidate queue.</strong> A plain text file of model ideas, each with a one-line hypothesis ("this newer architecture should make different mistakes than the current leader") and a link to prior work. Claude pulled from the front of the queue, ran the experiment, wrote the result back into the file, and moved on.</p></li><li><p><strong>A two-tier loop.</strong> A fast pass for cheap sanity checks (does this even train, does the output look sane) and a slow pass for the full evaluation and the honest gate. The fast pass killed the embarrassments before the slow one spent two hours on them.</p></li><li><p><strong>A single gate function.</strong> One function. In: a candidate's predictions. Out: the honest improvement over the leader, and a verdict, PASS only if it cleared a real threshold. That was the entire decision surface. Nothing got submitted that didn't pass.</p></li><li><p><strong>Checkpoint files.</strong> Every time the loop crossed a phase boundary (baseline verified, candidates exhausted, defensive submission filed), it wrote down the exact state at that moment, the way you'd read a flight log. I could open any one and know precisely where things stood, down to a fingerprint of each prediction file.</p></li><li><p><strong>Red-team reports.</strong> At each checkpoint, the loop wrote a short adversarial section: state a claim the system had just made ("v43 is the reproducible baseline"), construct an attack against that claim ("could the test predictions have been silently corrupted between submissions?"), pull the evidence (the file's fingerprint, the timestamp, the server's own echo of the numbers), give a verdict. If the attack landed, the claim was downgraded. If it failed, the claim was kept and the verdict was recorded next to it.</p></li><li><p><strong>Auto-archival.</strong> Anything that failed the honest gate was moved into an <code>_archive/</code> folder with the gate verdict baked into the directory name. The candidate didn't disappear, it sat in a folder I could re-read later and ask "what did we try, and why didn't it work."</p></li><li><p><strong>Defensive submission.</strong> If the loop didn't find a candidate that beat v43 in a 24-hour window, it re-submitted v43 anyway to keep the rate-limit clock alive. I never lost a submission slot. I never thought about it.</p></li></ul><p>What I contributed was not the model decisions. It was the architecture of trust. I decided what the gate would test, how big the threshold had to be, what counted as an adversarial attack worth running, when to archive versus when to keep, and what events promoted a checkpoint. Once those were defined, Claude could run for hours without me, and the artifacts it produced were either things I could trust at a glance or things the system had already flagged.</p><p>That is closer to running a small research group than to using a coding assistant. The researchers (various Claude sub-tasks, queued and dispatched by the loop controller) had real autonomy inside a framework I controlled at the level of "what counts as evidence." I was reviewing summaries, not steering decisions.</p><p>I'm going to keep saying "I" in the rest of this post, because the alternative gets tedious. But it's worth being explicit, once, about what that pronoun is covering for.</p><div><hr></div><h2>What I'm leaving with, and what I'm leaving on HuggingFace</h2><p>Everything is on <a href="https://huggingface.co/RyeCatcher/openadmet-pxr-challenge-2026">HuggingFace</a>. The full v43 pipeline, all the component model checkpoints, the predictions behind every result, the 513-row final submission, and the methodology report. Apache 2.0 across the board. If you want to reproduce the blend, it's five lines of code after cloning the repo. If you want to try beating v43 with your own candidate, those predictions are right there to test against.</p><blockquote><p><strong>The model is not the deliverable. The deliverable is the system that built it, and the documented map of where the ceiling lives.</strong></p></blockquote><p>A few specific things I am carrying forward:</p><ul><li><p>The auto-loop pattern is now confirmed to work in a domain I have no expertise in. That widens the set of problems it can be pointed at.</p></li><li><p>The honest calibration gate is non-optional from now on. I'll bake it into the first scaffold of any leaderboard or held-out evaluation problem I touch.</p></li><li><p>"Stop the loops at saturation" is a real engineering discipline. Once the loop has confirmed that diverse candidates all converge to the same answer, the next ten candidates will too. Every extra run costs money and tells you nothing new.</p></li></ul><p>Two weeks ago I had never heard of Chemprop. Today I have a reproducible ensemble in the top 19% of a real pharma blind challenge, a methodology report I'd defend on a panel, and a sharper sense of what the autonomous loop is for and what it is not. The Saturday morning after I locked the submission, my wife asked what I had been doing all week. I said I had been pulling threads. That is most of what this work is.</p><p>The next thread is already on the desk.</p><p><em>Justin Johnson writes Run Data Run. Code, weights, and the full methodology report for the PXR challenge live </em><a href="https://huggingface.co/RyeCatcher/openadmet-pxr-challenge-2026">on HuggingFace</a>.*</p>]]></content:encoded></item><item><title><![CDATA[Opus 4.8 and Workflows - One Careful Pass Is No Longer the Default]]></title><description><![CDATA[Anthropic shipped Opus 4.8 and Dynamic Workflows on the same day. Together they move the unit of agentic work from one model call to dozens of verified ones.]]></description><link>https://rundatarun.io/p/opus-48-and-workflows-one-careful</link><guid isPermaLink="false">https://rundatarun.io/p/opus-48-and-workflows-one-careful</guid><dc:creator><![CDATA[Justin Johnson]]></dc:creator><pubDate>Fri, 29 May 2026 09:14:52 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!So5X!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad9345b0-786a-456b-b096-c368b2546540_1365x768.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!So5X!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad9345b0-786a-456b-b096-c368b2546540_1365x768.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!So5X!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad9345b0-786a-456b-b096-c368b2546540_1365x768.png 424w, https://substackcdn.com/image/fetch/$s_!So5X!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad9345b0-786a-456b-b096-c368b2546540_1365x768.png 848w, https://substackcdn.com/image/fetch/$s_!So5X!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad9345b0-786a-456b-b096-c368b2546540_1365x768.png 1272w, https://substackcdn.com/image/fetch/$s_!So5X!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad9345b0-786a-456b-b096-c368b2546540_1365x768.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!So5X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad9345b0-786a-456b-b096-c368b2546540_1365x768.png" width="1365" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/ad9345b0-786a-456b-b096-c368b2546540_1365x768.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1365,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:190172,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://rundatarun.io/i/199713856?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad9345b0-786a-456b-b096-c368b2546540_1365x768.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!So5X!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad9345b0-786a-456b-b096-c368b2546540_1365x768.png 424w, https://substackcdn.com/image/fetch/$s_!So5X!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad9345b0-786a-456b-b096-c368b2546540_1365x768.png 848w, https://substackcdn.com/image/fetch/$s_!So5X!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad9345b0-786a-456b-b096-c368b2546540_1365x768.png 1272w, https://substackcdn.com/image/fetch/$s_!So5X!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fad9345b0-786a-456b-b096-c368b2546540_1365x768.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Anthropic shipped Opus 4.8 yesterday. The model bump alone is not the story.</p><p>The story is that two other things shipped beside it. A new Claude Code primitive called <a href="https://claude.com/blog/introducing-dynamic-workflows-in-claude-code">Dynamic Workflows</a>, which lets Claude write its own orchestration scripts and fan out to dozens of parallel subagents with adversarial verifiers built in. And a 3x cut to Opus Fast pricing, from $30/$150 per million tokens down to $10/$50, roughly 2.5x faster on top of it.</p><p>Those three changes are the same change. Anthropic just repositioned the unit of agentic work, and the pricing finally allows what the tooling implies.</p><h2><strong>What&#8217;s actually in 4.8</strong></h2><p>The model card is doing the usual benchmarks-go-up dance, but the prompting and runtime changes are where the day-to-day differences land.</p><p>The biggest is the move to <strong>adaptive reasoning only</strong>. <code>MAX_THINKING_TOKENS</code> is now ignored, and <code>CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING</code> is gone. The model decides how hard to think. To pin behavior you use the new <code>/effort</code> slash command (or <code>--effort</code> flag) across <code>low / medium / high / xhigh / max</code>. Claude Code defaults to <code>xhigh</code> for coding work; claude.ai and Cowork default to <code>high</code>. For a one-off deep pass without raising the whole session&#8217;s effort, drop the literal word <code>ultrathink</code> into the prompt and that turn alone reasons harder.</p><p>Four new or refreshed slash commands round it out. <code>/ultrareview</code> runs a senior-engineer review pass over the diff. <code>/simplify</code> does a refinement pass on recently modified code. <code>/focus</code> hides intermediate work and shows only the final output. <code>/fewer-permission-prompts</code> scans the session and writes a safer allowlist into <code>settings.json</code> so the harness stops interrupting you on read-only bash and MCP calls.</p><p>The Max plan gets the 1M context window by default, and the <strong>Fast mode price cut</strong> to $10 in / $50 out per million tokens makes the difference between Fast and default Opus closer to a latency choice than a budget one. There is also a research-preview Auto mode behind Shift+Tab that auto-approves safe actions and pauses on risky ones, aimed at long-running tasks where you want to walk away.</p><p>None of that is revolutionary by itself. The change in posture is.</p><h2><strong>What Dynamic Workflows actually is</strong></h2><p>A Dynamic Workflow is a JavaScript orchestration script that the model writes for itself. The script does not call the model directly. It calls four primitives that the harness wires into the session: <code>agent()</code> spawns a subagent and returns its result, <code>parallel()</code> fans tasks out concurrently with a barrier, <code>pipeline()</code> streams items through multiple stages without barriers between them, and <code>phase()</code> groups subagent calls under a progress label.</p><p>Two activation modes ship with it. <strong>Explicit</strong> &#8212; you say &#8220;create a workflow to audit this codebase for X&#8221; and Claude designs and runs the script. <strong>Implicit</strong> &#8212; you flip the <code>ultracode</code> setting on and Claude evaluates every task as a workflow candidate, reaching for fan-out instead of single passes by default. Ultracode is off out of the box, and the docs are clear about why. It burns tokens fast.</p><p>The interesting part is what gets baked in. Schema validation through a structured-output tool means subagents return validated objects, not strings you have to parse. Workflows resume from a prior <code>runId</code>, so an edit to your script doesn&#8217;t re-run the agents that didn&#8217;t change. There is a 1,000-agent lifetime cap per workflow as a runaway-loop backstop. And the documented &#8220;quality patterns&#8221; &#8212; adversarial verify, judge panel, loop-until-dry, multi-modal sweep, completeness critic &#8212; show Anthropic&#8217;s own hand on what good fan-out looks like.</p><blockquote><p>The most telling pattern is the adversarial verify. Spawn three independent skeptics per finding, prompt each to refute it, kill the finding if a majority succeed. Anthropic is not selling fan-out as more answers. They are selling it as more checks.</p></blockquote><p>This is the part that changes how you build.</p><h2><strong>The price math is the whole story</strong></h2><p>The orchestration-first pattern has been technically possible for a year. LangGraph wired it up. CrewAI wired it up. The reason almost nobody runs it as a default is the bill. Fifty parallel subagents on Opus 4.7 Fast was $30 in and $150 out per million tokens. A serious review pass on a real pull request would burn through a tank of compute and produce something a senior engineer could have written by hand in less time.</p><p>Opus 4.8 Fast is $10 in and $50 out. The same fan-out is now a third of the cost and 2.5x faster.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!mkiP!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F144ef84f-b9b9-48d3-b330-9577da361782_1408x768.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!mkiP!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F144ef84f-b9b9-48d3-b330-9577da361782_1408x768.png 424w, https://substackcdn.com/image/fetch/$s_!mkiP!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F144ef84f-b9b9-48d3-b330-9577da361782_1408x768.png 848w, https://substackcdn.com/image/fetch/$s_!mkiP!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F144ef84f-b9b9-48d3-b330-9577da361782_1408x768.png 1272w, https://substackcdn.com/image/fetch/$s_!mkiP!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F144ef84f-b9b9-48d3-b330-9577da361782_1408x768.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!mkiP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F144ef84f-b9b9-48d3-b330-9577da361782_1408x768.png" width="1408" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/144ef84f-b9b9-48d3-b330-9577da361782_1408x768.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:481703,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://rundatarun.io/i/199713856?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F144ef84f-b9b9-48d3-b330-9577da361782_1408x768.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!mkiP!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F144ef84f-b9b9-48d3-b330-9577da361782_1408x768.png 424w, https://substackcdn.com/image/fetch/$s_!mkiP!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F144ef84f-b9b9-48d3-b330-9577da361782_1408x768.png 848w, https://substackcdn.com/image/fetch/$s_!mkiP!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F144ef84f-b9b9-48d3-b330-9577da361782_1408x768.png 1272w, https://substackcdn.com/image/fetch/$s_!mkiP!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F144ef84f-b9b9-48d3-b330-9577da361782_1408x768.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>That is the number that changes behavior. Iteration loops that used to be <em>run this once, pray it found the bug</em> become <em>run it three times with different angles and trust the intersection</em>. Discovery sweeps that used to ship as a single grep become five finders with different lenses, deduped at the union. Verifier panels that used to be cosplay become a default.</p><blockquote><p>Fast at $10/$50 makes the fifty-agent review pass look like a Tuesday, not a stunt.</p></blockquote><h2><strong>What this looks like in practice</strong></h2><p>I have been running a version of this pattern on my own homelab since February. Eight named agents probe their own state in parallel, a separate evaluator scores their last forty-eight hours of output against a scope file, safe idempotent fixes auto-apply, and only the human-decision items surface for me. The expensive part was never the orchestration. It was the cost of getting eight independent passes to cohere before I trusted the verifier.</p><p>Anthropic&#8217;s own <code>review-changes</code> example shows the same shape with the rough edges sanded off. Dimensions fan out: bugs, performance, security, reuse, tests. Each dimension yields findings. Each finding is handed to a panel of independent skeptics whose prompt is <em>try to refute this</em>. A finding survives only if a majority of skeptics fail to refute. It is the same trick a good engineering org runs at code review, ported into the model layer and budgeted in tokens instead of senior-engineer hours.</p><h2><strong>What&#8217;s oversold</strong></h2><p>Two honest caveats.</p><p>The new skill being priced is not the orchestration script. It is the verifier. A fan-out that finds fifty plausible bugs and forwards all fifty is worse than the single careful pass it replaced, because it shifts the verification burden onto a human who now has to triage noise instead of read code. The workflows post handwaves the verifier as a prompt. Most builders I know do not have a verifier discipline yet, and the tooling for one is thin.</p><p>And ultracode, the setting that makes Claude reach for a workflow on every task without being asked, ships off by default. Anthropic&#8217;s documentation flags why. Fan-out burns quota fast, and there is no governance layer that says <em>do not orchestrate this task</em>. The dispatch problem, deciding when a workflow is the right shape and when a single careful pass is, is unsolved. Right now it lives in your head.</p><h2><strong>Why this is worth watching anyway</strong></h2><p>Two reasons.</p><p>The agentic frameworks built between 2024 and now (LangGraph, CrewAI, Autogen) were filling the gap where the model vendor did not ship orchestration. That gap just closed. The bridge between <em>one model call</em> and <em>agentic system</em> is now a tool the model writes for itself, in JavaScript, with caching and resume baked in. Whatever those frameworks were going to charge for over the next eighteen months, the ceiling on that price just dropped.</p><p>And the orchestration-first pattern was already where AI engineering was heading. <a href="https://rundatarun.io/p/evals-are-the-new-bottleneck">Evals are the new bottleneck</a> because at some point the question stops being <em>is the model good enough</em> and starts being <em>how confident am I that this particular answer is right</em>. Workflows give you a vocabulary for spending compute on that second question instead of just the first. The vendor that ships the verifier primitive first sets the pattern everyone else copies.</p><div><hr></div><p>If the price of careful goes down, the price of casual goes up. The default question stops being <em>is the model good enough yet</em>. It becomes <em>who is verifying</em>.</p><div><hr></div><p><em>Around the Corner: short reviews of ideas worth watching. Opt-in section, not part of the weekly Run Data Run email. <a href="https://rundatarun.io/subscribe">Subscribe to the main list</a> for longer essays.</em></p>]]></content:encoded></item><item><title><![CDATA[The Atomic Unit of Work Just Changed]]></title><description><![CDATA[The model commoditized. The frontier moved to the unit of work running on top of it, and that unit can now reason.]]></description><link>https://rundatarun.io/p/the-atomic-unit-of-work-just-changed</link><guid isPermaLink="false">https://rundatarun.io/p/the-atomic-unit-of-work-just-changed</guid><dc:creator><![CDATA[Justin Johnson]]></dc:creator><pubDate>Wed, 27 May 2026 11:31:49 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!c0Es!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F677f0b19-c40d-4127-bfcc-21e4391a2d8f_1408x768.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!c0Es!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F677f0b19-c40d-4127-bfcc-21e4391a2d8f_1408x768.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!c0Es!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F677f0b19-c40d-4127-bfcc-21e4391a2d8f_1408x768.png 424w, https://substackcdn.com/image/fetch/$s_!c0Es!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F677f0b19-c40d-4127-bfcc-21e4391a2d8f_1408x768.png 848w, https://substackcdn.com/image/fetch/$s_!c0Es!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F677f0b19-c40d-4127-bfcc-21e4391a2d8f_1408x768.png 1272w, https://substackcdn.com/image/fetch/$s_!c0Es!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F677f0b19-c40d-4127-bfcc-21e4391a2d8f_1408x768.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!c0Es!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F677f0b19-c40d-4127-bfcc-21e4391a2d8f_1408x768.png" width="1408" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/677f0b19-c40d-4127-bfcc-21e4391a2d8f_1408x768.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:779915,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://rundatarun.io/i/199448640?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F677f0b19-c40d-4127-bfcc-21e4391a2d8f_1408x768.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!c0Es!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F677f0b19-c40d-4127-bfcc-21e4391a2d8f_1408x768.png 424w, https://substackcdn.com/image/fetch/$s_!c0Es!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F677f0b19-c40d-4127-bfcc-21e4391a2d8f_1408x768.png 848w, https://substackcdn.com/image/fetch/$s_!c0Es!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F677f0b19-c40d-4127-bfcc-21e4391a2d8f_1408x768.png 1272w, https://substackcdn.com/image/fetch/$s_!c0Es!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F677f0b19-c40d-4127-bfcc-21e4391a2d8f_1408x768.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Last week I argued that <a href="https://rundatarun.io/p/the-model-is-no-longer-the-frontier">the model is no longer the frontier</a>. The first academic conference on agentic systems convened, and almost none of the papers touched the model itself. The hard part had moved to everything that makes a model useful, trustworthy, and safe to run.</p><p>That post ended on a question it didn&#8217;t answer. If the frontier moved off the model, where exactly did it land, and what is the thing we&#8217;re now building on top? An agent is too big to be the answer. A prompt is too small. The useful unit sits in between, and naming it is what every team trying to run real work on this eventually has to do.</p><p>I think the unit is a skill. Not in the soft &#8220;upskilling&#8221; sense. A skill in the precise sense: a named, versioned, governed piece of procedural knowledge that a frontier model executes with judgment.</p><p>The field already has a word for the layer below it. A tool is the atomic unit of computation: a function, an API call, deterministic, it executes and returns the same way every time. That layer didn&#8217;t change. What changed is that a unit appeared above it, the atomic unit of process, and unlike a function it reasons. That unit is the skill, and for the first time it&#8217;s a thing you can hold, count, and hand to a model.</p><blockquote><p><strong>A tool is the atomic unit of computation. A skill is the atomic unit of process, and unlike a function, it reasons.</strong></p></blockquote><h2><strong>What a unit looks like in practice</strong></h2><p>I run about 70 of them. Across four machines I have 70 Claude Code skills, 23 subagents, 21 slash commands, and 8 autonomous agents that operate without me in the loop. That sounds like a lot of plumbing. It&#8217;s mostly not plumbing. Most of it is procedure.</p><p>One reads a company&#8217;s website and tells me where the substance ends and the slideware begins, then names the three rivals worth weighing it against. Another throws a hard question at four rival frontier models at once and writes back the answer none of them gave on its own. A third one read this essay before you did. It pulled every factual claim into a list, played hostile reviewer against each one, and handed back a verdict on which to keep, soften, or cut. The soft ones got fixed before they reached you.</p><p>Strip one open and there is less than you would expect. A skill is a folder with a Markdown file in it. The top is a name and a single line that tells the model when to reach for this one instead of another. Everything below is the job in plain English: what good looks like, the traps, the one rule it must never break. The red-team skill is about a page of that. Roughly:</p><pre><code><code>name: red-team
description: Run before publishing any fact-making draft. Pull every
  claim into a list, attack each as a hostile reviewer, return a
  keep / soften / cut verdict per claim.
---
You are a skeptical senior reviewer. For each claim, trace it to its
source, then decide whether it survives a hostile read. Name the exact
sentence that is weak, and why...</code></code></pre><p>No special language, no framework, no engineer required to change it. That is close to the entire artifact, and it is the part that surprises people: the procedure is now a document you can read, edit, and argue with, the same way you would mark up a policy memo. <strong>A skill is a Word doc that happens to run.</strong></p><p>That third skill is the interesting kind, because I could never have written its rules. I didn&#8217;t enumerate &#8220;if a claim names a competitor, soften it&#8221; or &#8220;if a number has no source, flag it.&#8221; I wrote down what a sharp adversarial reviewer does, and the model plays that reviewer against whatever I hand it. That is the shift. The unit carries the intent; the model supplies the judgment at runtime.</p><h2><strong>Why this breaks from how software has always worked</strong></h2><p>Here is where I want to be careful, because the easy version of this claim is false and any engineer reading will throw the post across the room.</p><p>Software has always branched. If-else, state machines, BPMN gateways, rules engines, robotic process automation with decision nodes, a fraud model scoring a transaction. Branching is not new, and &#8220;agents can make decisions&#8221; is not the breakthrough.</p><p>The real difference is narrower and sturdier. <strong>Every branch in traditional software had to be enumerated in advance.</strong> A person specified the decision logic, case by case, and anything outside the specified cases fell through to an error or escalated to a human. Even a machine-learning classifier, which feels like judgment, decides inside a space someone defined and trained for. When reality served up a case nobody anticipated, the system stopped.</p><p>A frontier model at the node changes the shape of that. The decision logic no longer has to be fully written down ahead of time. The model handles inputs the author never specified, by reasoning about them in context. The branch space is open instead of closed. The set of situations the system can handle gracefully is no longer the finite list you remembered to write.</p><blockquote><p><strong>Every branch in old software was written down in advance. A frontier model handles the case nobody wrote down.</strong></p></blockquote><p>This is not &#8220;the model does whatever it wants.&#8221; The harness around it does real work: rules constrain what&#8217;s allowed, hooks fire deterministically at fixed points, the skill itself carries guardrails. The structure is rigid where it needs to be. The judgment is open where rigidity used to force a halt. That combination, a deterministic harness wrapped around an open-branch reasoner, is different from the software we&#8217;ve shipped for forty years, and it&#8217;s why the old &#8220;automate the workflow&#8221; instinct undersells what&#8217;s happening.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!cers!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b7dc87-8837-44a7-a8ff-69bb76bd2c21_1408x768.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!cers!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b7dc87-8837-44a7-a8ff-69bb76bd2c21_1408x768.png 424w, https://substackcdn.com/image/fetch/$s_!cers!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b7dc87-8837-44a7-a8ff-69bb76bd2c21_1408x768.png 848w, https://substackcdn.com/image/fetch/$s_!cers!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b7dc87-8837-44a7-a8ff-69bb76bd2c21_1408x768.png 1272w, https://substackcdn.com/image/fetch/$s_!cers!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b7dc87-8837-44a7-a8ff-69bb76bd2c21_1408x768.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!cers!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b7dc87-8837-44a7-a8ff-69bb76bd2c21_1408x768.png" width="1408" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/20b7dc87-8837-44a7-a8ff-69bb76bd2c21_1408x768.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:571295,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://rundatarun.io/i/199448640?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b7dc87-8837-44a7-a8ff-69bb76bd2c21_1408x768.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!cers!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b7dc87-8837-44a7-a8ff-69bb76bd2c21_1408x768.png 424w, https://substackcdn.com/image/fetch/$s_!cers!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b7dc87-8837-44a7-a8ff-69bb76bd2c21_1408x768.png 848w, https://substackcdn.com/image/fetch/$s_!cers!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b7dc87-8837-44a7-a8ff-69bb76bd2c21_1408x768.png 1272w, https://substackcdn.com/image/fetch/$s_!cers!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F20b7dc87-8837-44a7-a8ff-69bb76bd2c21_1408x768.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2><strong>The field landed on the same unit, all at once</strong></h2><p>I&#8217;d been living with this and treating it as a personal quirk. Then, inside a three-week window in May, three separate research groups published work that all circled the same object from different sides.</p><p><a href="https://arxiv.org/abs/2605.23904">SkillOpt</a>, out of Microsoft, treats a skill document like a set of weights you can train: run it, reflect on what broke, make bounded edits, gate the change behind a validation step. Optimize the unit you have.</p><p><a href="https://arxiv.org/abs/2605.18401">SkillsVote</a> governs a whole library of units across its life. Collect them, recommend the right one, attribute an outcome to the specific skill that produced it, and evolve only the ones that actually helped. Govern the library.</p><p><a href="https://arxiv.org/abs/2605.06614">SkillOS</a> splits the agent in two: a trainable curator that manages the skills, and a frozen executor that runs them. Make curation a first-class component of the system, not an afterthought.</p><p>Three labs, three angles, one conclusion: <strong>the next gains live in the skill layer, not in the weights.</strong> The simultaneity is the signal. When independent teams name the same problem in the same month, it stops being a hunch and becomes the shape of the field.</p><p>The week those papers landed, I hit the identical wall by hand. My corrections had been piling into skill files with no discipline, every fix making the library a little harder to reason about. So I built two things: an evolution loop that traces a failure to its actual cause before editing anything, and a curator that proposes which skills to merge or retire and never executes on its own. No reinforcement learning, no training loop. The same instinct the papers formalized, reached from the practitioner side because the pain forced it.</p><h2><strong>Verbs, procedures, and the org as a graph</strong></h2><p>A skill is a verb. The next move is to compose verbs into procedures, and that is arriving now. The workflow capability landing in Claude Code (community-documented around v2.1.147, not a formally announced product, so hold it loosely) wires skills into deterministic control flow: phases, conditionals, parallel steps, loops, with model reasoning at each node instead of a hardcoded branch. One builder called it turning standard operating procedures into executable graphs. The right instinct, even this early.</p><p>The picture is bigger than my homelab. Daniel Miessler argued in 2024 that a company is just a graph of algorithms: every business process decomposes into nested procedures, all the way down. For most of corporate history those procedures lived in two forms that never matched. Rigid software that couldn&#8217;t deviate, or a written SOP that described what people should do while they quietly did something else.</p><blockquote><p><strong>A company is a graph of algorithms. Now every node thinks.</strong></p></blockquote><p>The unit I&#8217;ve been describing collapses that gap. A procedure encoded as composable skills is <strong>both the description and the execution.</strong> The SOP stops being a document nobody opens and becomes the thing that runs. The process-management world sees it coming: the field is reframing from predictable, predefined paths toward event-driven, AI-orchestrated systems, and incumbent banks are reengineering processes designed for a pre-AI era. This is not a startup story. It&#8217;s the operating model changing underneath established companies.</p><p>The catch is structural. <strong>A workflow is only as trustworthy as the skills it composes.</strong> Stack reasoning units into a procedure and every weakness in one unit propagates. The moment you compose, the governance of the underlying library stops being hygiene and becomes what the whole procedure rests on.</p><h2><strong>The job this creates for a leader</strong></h2><p>So the managerial question is not &#8220;what can our AI do.&#8221; Capability is the part that commoditizes, the same way model access did. The durable question is whether you own the library of how your organization actually works, as composable units you can govern.</p><p>That ownership has a shape. It needs a curator, someone accountable for the library the way a maintainer is accountable for a codebase. It needs the retire verb to be as cheap as the add verb, because a library where you can only add is a library you&#8217;ll be unable to reason about in six months. And it needs an attribution loop, a way to tie an outcome back to the unit that caused it, so you evolve what helped and cut what didn&#8217;t.</p><blockquote><p><strong>A library where you can only add is one you can&#8217;t reason about in six months.</strong></p></blockquote><p>The sharpest framing of the stakes I saw this month came from a builder describing the new scarcity: the constraint is no longer labor efficiency, it&#8217;s output per unit of attention. The scarce input in this era isn&#8217;t compute or headcount. It&#8217;s the density of human judgment encoded into the system. A curated library is how you bank that judgment so it compounds instead of evaporating when someone leaves.</p><h2><strong>The honest version</strong></h2><p>I&#8217;m not going to pretend the path is clean, and I&#8217;ve written the unglamorous parts elsewhere so I won&#8217;t relitigate them here. The returns aren&#8217;t showing up at scale yet for most deployments, the most common failure is coordination and handoff rather than raw capability, and encoding a process tends to expose the politics that the process&#8217;s ambiguity was quietly protecting. If you want the failure modes in detail, I laid out why most agent projects miss in <a href="https://rundatarun.io/p/the-agent-archaeology-checklist-8">the agent archaeology checklist</a>, and the accountability patterns that hold up in regulated work in <a href="https://ai.rundatarun.io/Emerging%20Trends/ai-as-exoskeleton-not-coworker">AI as exoskeleton, not coworker</a>. The honest read is that the unit exists and the discipline around it is not yet mature. Both things are true at once.</p><h2><strong>Where this leaves you</strong></h2><p>If you run an AI program, change what you measure. Stop tracking the ceiling of what your tools can do and start tracking <strong>what fraction of your skill and workflow library you can still explain,</strong> and whether retiring a capability is as cheap as adding one. If only the add verb is cheap, you are accumulating drift, not advantage.</p><p>The model got cheap. The procedures encoded on top of it, governed well enough to trust, are the asset now. <strong>That is the frontier the conference papers were circling, and it&#8217;s the one your organization actually has to build.</strong></p>]]></content:encoded></item><item><title><![CDATA[The Model Is No Longer the Frontier]]></title><description><![CDATA[ACM convenes its first conference on agentic systems next week. Almost none of the papers touch the model. That's the story.]]></description><link>https://rundatarun.io/p/the-model-is-no-longer-the-frontier</link><guid isPermaLink="false">https://rundatarun.io/p/the-model-is-no-longer-the-frontier</guid><dc:creator><![CDATA[Justin Johnson]]></dc:creator><pubDate>Sun, 24 May 2026 11:08:42 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!wqeE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff12bb61f-69c1-49a7-a399-4919cf1026e5_5504x3072.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Every Sunday I pick one paper (or 10!) or release that&#8217;s worth your time, break it apart, and tell you why it matters. No hype. No summaries of summaries. Just the idea, explained.</em></p><div><hr></div><p>Next week, ACM convenes the first edition of a conference that did not exist a year ago. Not a workshop bolted onto NeurIPS, not a model-release press event. A standalone Association for Computing Machinery conference, the kind that publishes a proceedings and sets a field&#8217;s agenda for the decade. It is called CAIS, the Conference on AI and Agentic Systems, it runs in San Jose from May 26 to 29, and its program already lists 61 peer-reviewed papers and 45 system demonstrations.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!wqeE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff12bb61f-69c1-49a7-a399-4919cf1026e5_5504x3072.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!wqeE!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff12bb61f-69c1-49a7-a399-4919cf1026e5_5504x3072.png 424w, https://substackcdn.com/image/fetch/$s_!wqeE!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff12bb61f-69c1-49a7-a399-4919cf1026e5_5504x3072.png 848w, https://substackcdn.com/image/fetch/$s_!wqeE!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff12bb61f-69c1-49a7-a399-4919cf1026e5_5504x3072.png 1272w, https://substackcdn.com/image/fetch/$s_!wqeE!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff12bb61f-69c1-49a7-a399-4919cf1026e5_5504x3072.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!wqeE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff12bb61f-69c1-49a7-a399-4919cf1026e5_5504x3072.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/f12bb61f-69c1-49a7-a399-4919cf1026e5_5504x3072.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:22022528,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://rundatarun.io/i/199055941?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff12bb61f-69c1-49a7-a399-4919cf1026e5_5504x3072.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!wqeE!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff12bb61f-69c1-49a7-a399-4919cf1026e5_5504x3072.png 424w, https://substackcdn.com/image/fetch/$s_!wqeE!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff12bb61f-69c1-49a7-a399-4919cf1026e5_5504x3072.png 848w, https://substackcdn.com/image/fetch/$s_!wqeE!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff12bb61f-69c1-49a7-a399-4919cf1026e5_5504x3072.png 1272w, https://substackcdn.com/image/fetch/$s_!wqeE!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ff12bb61f-69c1-49a7-a399-4919cf1026e5_5504x3072.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Read the call for papers and you notice what is missing. No track for scaling laws. No track for a new architecture that beats the last one on a benchmark. The four research areas are architectural patterns and composition, system optimization and efficiency, engineering and operations of compound AI systems, and evaluation and benchmarking. ACM, the most conservative naming body in computing, looked at the field and decided the interesting problem was no longer the model. It was the system around the model.</p><p>I spent the week reading the board. I ranked ten papers by how directly they touch the systems I run, decoded two of them line by line, and came out with one conclusion that I did not expect going in.</p><blockquote><p><strong>The frontier moved, and almost nobody is looking at where it went.</strong></p></blockquote><h2>The Headline</h2><p>Here is the whole argument in a paragraph. <strong>Models stopped being the bottleneck.</strong> The thing that decides whether an agent works in production is the scaffolding: how it improves itself, whether you can tell when it is quietly wrong, and whether you can govern what it is allowed to do. CAIS is the first major conference built around agentic-systems engineering, treating that scaffolding as the main event rather than the cleanup crew. Of the ten papers I pulled, the count that modify model weights is roughly zero. Every one is about turning a frozen model into a dependable agent. And the conference&#8217;s own vote board ranks them in almost exactly the wrong order for anyone who has to ship one.</p><h2>The Concepts</h2><p>Start with the inversion, because it reframes everything else.</p><p>On <a href="https://www.alphaxiv.org/acm-cais">alphaxiv</a>, the site where the AI research community upvotes the papers it finds most interesting, the conference runs a public vote board. As of mid-May, the runaway favorite, at 653 votes, is Scideator, a tool for scientific ideation. It is a lovely paper and I will come back to it. The papers sitting at the bottom of the vote count are an agent specification format at 9 votes, a schema for evaluating and routing requests across model gateways at 17, and a self-evolving agent memory system at 19.</p><p>If you have ever tried to put an agent into production, you already feel the problem. The 9-vote paper is the one you need on a Tuesday. The 653-vote paper is the one you screenshot for a keynote. The crowd rewards what is interesting to think about. The practitioner needs what holds weight when you build on it, and those are not the same papers. Once you see it, you read the whole conference differently: <strong>the boring papers are the roadmap, and the exciting ones are the scenery.</strong></p><blockquote><p><strong>The vote board is a near-perfect inverse map of production relevance.</strong></p></blockquote><p>With that lens, the ten papers fall into three movements.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!o-6Z!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe682c62f-fab8-4119-8800-434dcb13e758_1408x768.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!o-6Z!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe682c62f-fab8-4119-8800-434dcb13e758_1408x768.png 424w, https://substackcdn.com/image/fetch/$s_!o-6Z!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe682c62f-fab8-4119-8800-434dcb13e758_1408x768.png 848w, https://substackcdn.com/image/fetch/$s_!o-6Z!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe682c62f-fab8-4119-8800-434dcb13e758_1408x768.png 1272w, https://substackcdn.com/image/fetch/$s_!o-6Z!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe682c62f-fab8-4119-8800-434dcb13e758_1408x768.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!o-6Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe682c62f-fab8-4119-8800-434dcb13e758_1408x768.png" width="1408" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e682c62f-fab8-4119-8800-434dcb13e758_1408x768.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1408,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:607030,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://rundatarun.io/i/199055941?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe682c62f-fab8-4119-8800-434dcb13e758_1408x768.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!o-6Z!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe682c62f-fab8-4119-8800-434dcb13e758_1408x768.png 424w, https://substackcdn.com/image/fetch/$s_!o-6Z!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe682c62f-fab8-4119-8800-434dcb13e758_1408x768.png 848w, https://substackcdn.com/image/fetch/$s_!o-6Z!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe682c62f-fab8-4119-8800-434dcb13e758_1408x768.png 1272w, https://substackcdn.com/image/fetch/$s_!o-6Z!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe682c62f-fab8-4119-8800-434dcb13e758_1408x768.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h3>Movement one: getting better without retraining</h3><p>For most of the deep-learning era, &#8220;make the model better at your task&#8221; meant one thing. Collect data, adjust weights, redeploy. That loop is slow, expensive, and it forgets nothing gracefully. The first cluster of CAIS papers throws it out.</p><p>FORGE is the cleanest statement of the shift. Its subtitle is the thesis: self-evolving agent memory, no weight updates. The agent gets better by accumulating and reorganizing what it has learned from its own runs, not by touching a single parameter. A second paper, on composing policy gradients with prompt optimization for language-model programs, comes at the same idea from a line of work known in the field as DSPy and GEPA. The move is to treat the prompt and the program wrapping the model as the thing you tune, and leave the weights frozen. The intelligence you add lives outside the network.</p><p>Scideator, the crowd favorite, belongs here too once you look past the application. Strip away the scientific-ideation framing and it is a recombination engine. It decomposes any paper into three facets, the Purpose, the Mechanism, and the Evaluation, then retrieves analogous work at deliberately varied conceptual distances and lets a researcher mix facets to seed new ideas. The mechanism that makes it work is not the language model. It is the facet decomposition. The authors prove it by removing the facet step and watching the system fall apart, a result that should be tattooed on the arm of anyone who has built a retrieval system: their novelty checker hits 89.66% accuracy at flagging non-novel ideas with the facet-aware reranking in place, and collapses to 13.79% without it. Same model, same corpus. The structure around the model did the work, and removing it took more than seventy points of accuracy with it. In a controlled study of 22 researchers, the tool lifted a standard creativity-support score from a median of 61 to 70.5.</p><blockquote><p><em>Same model, same corpus. The structure around the model did the work, and removing it took more than seventy points of accuracy with it.</em></p></blockquote><p>Three papers, one message. <strong>The improvement loop has left fine-tuning behind.</strong> It now lives in memory, in prompts, and in the structure you impose on retrieval. If your mental model of &#8220;improving an AI system&#8221; still starts with a training run, you are optimizing the layer that stopped moving.</p><h3>Movement two: trusting what you cannot see fail</h3><p>This is the movement I cannot stop thinking about, and it is anchored by the best paper I read all week.</p><p>The setup of &#8220;Trace-Level Analysis of Information Contamination in Multi-Agent Systems&#8221; is simple enough to explain at dinner. Take a team of specialist agents solving a hard task, the kind where one agent reads a PDF, another reads a table, another checks the facts, and a synthesizer writes the answer. Now corrupt the inputs the way the real world corrupts them. Swap two columns in a spreadsheet. Add scanner noise to a document. Blur an image. Run the same task clean and dirty, hold the workflow fixed, and watch what happens.</p><p>What happens is unsettling. The errors do not announce themselves. They propagate through the agent chain and poison the final answer without throwing a single exception. The authors name this information contamination, and across 614 paired runs they measure two things separately that everyone usually conflates: did the agent&#8217;s behavior change, and did the answer go wrong. Those two questions come apart. In 15.3% of runs the answer was silently corrupted while the execution trace barely moved. <strong>The system looked healthy and was lying to you.</strong></p><p>Then comes the finding that should change how you operate agents tomorrow. Cost does not track correctness. Only 16.3% of the high-cost runs, the ones burning more than twice the normal tokens, actually succeeded. And 76.2% of the cheap, low-overhead runs failed silently. If you are watching token spend as a proxy for &#8220;the agent is working hard on a hard problem, probably fine,&#8221; you have the dashboard exactly backwards.</p><blockquote><p><strong>Expensive often means thrashing. Cheap often means confidently wrong.</strong></p></blockquote><p>The companion paper here is &#8220;Why Johnny Can&#8217;t Use Agents,&#8221; a study of the gap between what the industry promises agents will do and what users actually experience trying to use them. It is the human-facing version of the same truth the contamination paper proves mechanically. The demo works. The deployment is full of quiet failures that nobody designed a way to see. Most reliability work in this space watches for crashes. The crashes are the easy case. The dangerous failures are the ones where the trace looks normal, the cost looks fine, and the answer is wrong.</p><h3>Movement three: governing the control surface</h3><p>The third cluster is the least glamorous and the most enterprise. It is about what an agent is allowed to do, what it is running, and whether you can read what it did.</p><p>&#8220;Malice in Agentland&#8221; maps backdoors in the AI supply chain. The moment your agents install skills, pull in MCP servers (the connectors that hand an agent new tools), or run community plugins, you have inherited a software supply chain with all the trust problems of npm and none of the decade of hard-won defenses. A companion paper on tracking capabilities for safer agents comes at the same risk from the permission side: constrain what each component can touch, so a compromised piece cannot reach the whole system. Then there is the Open Agent Specification, the 9-vote paper, which proposes a declarative format for describing an agent and a standardized way to trace what it did. Add the gateway-routing schema and the structured-generation work, XGrammar-2, and you have the full legibility layer. A way to specify the agent, route its requests, force its outputs into a checkable shape, and audit the trace afterward.</p><p>None of this is exciting. All of it is what stands between a clever demo and something a regulated company will run.</p><blockquote><p><strong>This is the plumbing, and the plumbing is the frontier.</strong></p></blockquote><h2>Why It Matters</h2><p>If you lead a team that is building with agents, the three movements are not a literature review. They are a budget, a monitoring strategy, and a security posture. Here is what I would do with them.</p><p><strong>Stop reading cost as a health signal.</strong> This is the single most actionable finding of the conference, and it is counterintuitive enough that it will not occur to your team on its own. The contamination paper proves that token spend and correctness are decoupled, and not weakly. Three quarters of cheap runs failed silently; five sixths of expensive runs failed too. If your agent observability is built on &#8220;alert me when spend spikes,&#8221; you are monitoring the wrong variable. The replacement is harder and unavoidable: outcome verification that is independent of the execution trace. A second check that looks at the answer, not at how much the agent thrashed to produce it. I have been running my own multi-agent setup with cost as a rough health proxy, and this paper is the reason I am rebuilding that assumption. The agents that worried me were the loud, expensive ones. The paper says the quiet, cheap ones are the threat.</p><p><strong>Budget for memory and governance, not just inference.</strong> The whole improve-trust-govern stack costs real money and engineering time, and almost none of it is GPU. For two years the agent budget conversation has been about tokens and context windows. The CAIS papers are a forecast that the next year&#8217;s spend moves to the layers around the model: memory systems that let agents improve without retraining, verification systems that catch silent failure, governance systems that constrain what agents can reach. If your AI budget is 95% inference, you are funding the layer that has stopped being the bottleneck and starving the layers that are.</p><p><strong>Treat skills, MCP servers, and plugins as a supply chain, because they are one.</strong> I have run a release-age cooldown on every package install across my homelab since a poisoned npm package made the rounds earlier this spring. Nothing new goes in until it has aged a week in public. &#8220;Malice in Agentland&#8221; and the capability-tracking work are the academic version of that instinct, and they generalize it. Every skill your agent loads is untrusted code with the agent&#8217;s permissions. The defenses are the unsexy ones from traditional software security: constrain capabilities, verify provenance, age your dependencies, audit the trace. The enterprise that gets this right early will be the one allowed to deploy agents in places that matter, because it can answer the question every risk committee will ask, which is what is this thing actually allowed to touch.</p><p><strong>Read the vote board upside down.</strong> This is the meta-lesson, and it outlasts these specific papers. When you scan a conference, a launch, or a feed for what matters, the engagement signal points you at what is interesting to discuss, which is reliably not what is important to build on. The 9-vote agent specification is duller than the 653-vote ideation tool and far more likely to be infrastructure you depend on in two years. The discipline is to invert the popularity signal on purpose, to go looking for the paper everyone scrolled past because the title had the words &#8220;schema&#8221; or &#8220;specification&#8221; in it. The boring papers are where the field is moving. The exciting ones are where it has already been.</p><p>The deeper pattern under all of this is one that every technology repeats. It starts as a demo, a single impressive capability that makes people gasp. Then it becomes a system, and the interesting questions stop being &#8220;can it do the thing&#8221; and become &#8220;can I trust it, maintain it, and keep it from hurting me.&#8221; Databases made that move. The web made it. Distributed systems made it, and the engineering discipline that grew up around them is most of why anything online stays up. ACM launching a conference named for agentic systems, with tracks for operations and evaluation and composition, is the field announcing that the model era is the demo era, and the demo era is ending.</p><p>I run a multi-agent research system and a small fleet of autonomous agents on my own hardware, and everything in this post is something I have either been burned by or am about to rebuild because of these papers. The cost-as-health assumption was mine. The supply-chain cooldown came from getting scared, not from a paper. Reading the CAIS program felt less like learning something new and more like watching the academy catch up to what anyone running agents in anger already suspected. The model is the easy part now. Everything that makes the model useful, trustworthy, and safe is the hard part, and it is finally getting its own conference.</p><p>The frontier moved off the model. The papers that matter are the ones nobody voted for. And the work that decides whether any of it ships is all in the scaffolding.</p><div><hr></div><h2>The papers</h2><p>All ten, grouped the way the post reads them, with their alphaxiv vote counts as of mid-May. The conference board is <a href="https://www.alphaxiv.org/acm-cais">here</a>. I decoded the first two line by line; the rest I read at the level the post describes.</p><p><strong>Improve without retraining</strong></p><ul><li><p><a href="https://arxiv.org/abs/2409.14634">Scideator: Human-LLM Scientific Ideation via Facet Recombination and Novelty Evaluation</a> (653 votes) &#8212; the recombination engine; facet decomposition is the mechanism, not the model.</p></li><li><p><a href="https://arxiv.org/abs/2508.04660">Composing Policy Gradients and Prompt Optimization for Language Model Programs</a> (428) &#8212; tune the prompt and the program, freeze the weights. The DSPy / GEPA line of work.</p></li><li><p><a href="https://arxiv.org/abs/2605.16233">FORGE: Self-Evolving Agent Memory With No Weight Updates</a> (19) &#8212; agents improve by reorganizing what they learned from their own runs.</p></li></ul><p><strong>Trust what you cannot see fail</strong></p><ul><li><p><a href="https://arxiv.org/abs/2604.27586">Trace-Level Analysis of Information Contamination in Multi-Agent Systems</a> (74) &#8212; the cost-correctness decoupling; 614 paired runs, silent corruption.</p></li><li><p><a href="https://arxiv.org/abs/2509.14528">Why Johnny Can&#8217;t Use Agents: Industry Aspirations vs. User Realities</a> (151) &#8212; the human-facing version of the same gap.</p></li></ul><p><strong>Govern the control surface</strong></p><ul><li><p><a href="https://arxiv.org/abs/2510.05159">Malice in Agentland: Backdoors in the AI Supply Chain</a> (128) &#8212; your agents&#8217; skills and plugins are an untrusted supply chain.</p></li><li><p><a href="https://arxiv.org/abs/2603.00991">Tracking Capabilities for Safer Agents</a> (91) &#8212; constrain what each component can touch.</p></li><li><p><a href="https://www.alphaxiv.org/acm-cais">Open Agent Specification</a> (9) &#8212; a declarative format for an agent plus a standard way to trace what it did.</p></li><li><p><a href="https://arxiv.org/abs/2603.26728">SEAR: Schema-Based Evaluation and Routing for LLM Gateways</a> (17) &#8212; eval and routing across model gateways.</p></li><li><p><a href="https://arxiv.org/abs/2601.04426">XGrammar-2: Structured Generation for Agentic LLMs</a> (142) &#8212; force model outputs into a checkable shape.</p></li></ul><div><hr></div><p><em>Sunday Deep Dive is a weekly series on <a href="https://rundatarun.io">Run Data Run</a>. Every Sunday I pick one paper (or 10!), release, or technique worth understanding, break it apart, and tell you what it means for your work. Free every Sunday, no paywall. If it was useful, the easiest way to support it is to subscribe and forward it to one person on your team who&#8217;d want it. If it wasn&#8217;t, tell me why. I&#8217;ll make it better.</em></p>]]></content:encoded></item><item><title><![CDATA[The Number That Predicts When Your Agent Will Break]]></title><description><![CDATA[A new benchmark gives a name to the failure practitioners keep calling "complex reasoning."]]></description><link>https://rundatarun.io/p/the-number-that-predicts-when-your</link><guid isPermaLink="false">https://rundatarun.io/p/the-number-that-predicts-when-your</guid><dc:creator><![CDATA[Justin Johnson]]></dc:creator><pubDate>Fri, 22 May 2026 12:12:22 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!kIK2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe184f0eb-b20d-46bc-80f5-172c90b35b41_1376x768.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!kIK2!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe184f0eb-b20d-46bc-80f5-172c90b35b41_1376x768.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!kIK2!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe184f0eb-b20d-46bc-80f5-172c90b35b41_1376x768.png 424w, https://substackcdn.com/image/fetch/$s_!kIK2!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe184f0eb-b20d-46bc-80f5-172c90b35b41_1376x768.png 848w, https://substackcdn.com/image/fetch/$s_!kIK2!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe184f0eb-b20d-46bc-80f5-172c90b35b41_1376x768.png 1272w, https://substackcdn.com/image/fetch/$s_!kIK2!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe184f0eb-b20d-46bc-80f5-172c90b35b41_1376x768.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!kIK2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe184f0eb-b20d-46bc-80f5-172c90b35b41_1376x768.png" width="1376" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/e184f0eb-b20d-46bc-80f5-172c90b35b41_1376x768.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1376,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:527480,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://rundatarun.io/i/198834597?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe184f0eb-b20d-46bc-80f5-172c90b35b41_1376x768.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!kIK2!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe184f0eb-b20d-46bc-80f5-172c90b35b41_1376x768.png 424w, https://substackcdn.com/image/fetch/$s_!kIK2!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe184f0eb-b20d-46bc-80f5-172c90b35b41_1376x768.png 848w, https://substackcdn.com/image/fetch/$s_!kIK2!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe184f0eb-b20d-46bc-80f5-172c90b35b41_1376x768.png 1272w, https://substackcdn.com/image/fetch/$s_!kIK2!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fe184f0eb-b20d-46bc-80f5-172c90b35b41_1376x768.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p></p><p>A new paper asks a question that sounds simple and turns out to have teeth. When a frontier model fails at &#8220;reasoning,&#8221; what is it failing at?</p><p>Their answer is a number. They call it Relational Complexity, and it predicts model failure better than anything else they measured.</p><p>The paper, <a href="https://arxiv.org/abs/2604.12176">&#8220;Evaluating Relational Reasoning in LLMs with REL&#8221;</a> from Fesser, Ektefaie, Fang, Kakade, and Zitnik, borrows a construct from cognitive science. Relational Complexity (RC) is the minimum number of entities a system has to hold in mind and bind together at once to take a single reasoning step. &#8220;A is taller than B&#8221; is RC=2. &#8220;A is between B and C&#8221; is RC=3. The number climbs as the relations get wider.</p><p>The finding is clean and a little grim. As RC goes up, accuracy falls off a cliff, and nothing the authors tried pulled it back.</p><h2>What they measured</h2><p>The clever part is the benchmark itself. REL is a generative framework, not a fixed test set. It produces as many problems as you want at any RC level, across three domains the authors deliberately picked to look nothing alike: pattern-completion puzzles (Raven&#8217;s matrices), phylogenetic trees in biology, and molecular isomers in chemistry.</p><p>Why three unrelated domains? Because that lets them hold everything else constant. Same vocabulary, same input length, same task format, only the RC dial moving. Most reasoning benchmarks can&#8217;t separate &#8220;the task is harder&#8221; from &#8220;the task has more words&#8221; or &#8220;the task is in an unfamiliar domain.&#8221; REL can.</p><blockquote><p>Hold vocabulary, length, and format fixed, move only the relational complexity, and watch the accuracy curve bend. That is the whole experiment, and it is enough.</p></blockquote><p>They ran it against Claude Opus 4.5, Gemini 3 Pro, and GPT 5.2.</p><h2>The numbers</h2><p>At low complexity, the models look great. The pattern puzzles at RC=1-2 land around <strong>91% accuracy</strong> across all three models.</p><p>Then it collapses. Scale the matrices up, push RC to 6, and Claude and Gemini drop to roughly <strong>12%</strong>. The biology task tells the same story: phylogenetic homoplasy detection runs at 35% with four taxa and falls to <strong>1% at twenty-five taxa</strong>.</p><p>The authors then did the thing most benchmark papers skip. They ran a regression to check whether RC was actually the driver or just correlated with something else. With collinearity controls in place, <strong>RC explained 24 to 44% of the explainable variance</strong>. The next-strongest factor topped out at 17%.</p><blockquote><p><strong>It is not input length. It is not domain. It is the binding count</strong></p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!w2Jg!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8b5acff-d460-4cbd-88aa-bb28065276c3_1879x1100.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!w2Jg!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8b5acff-d460-4cbd-88aa-bb28065276c3_1879x1100.png 424w, https://substackcdn.com/image/fetch/$s_!w2Jg!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8b5acff-d460-4cbd-88aa-bb28065276c3_1879x1100.png 848w, https://substackcdn.com/image/fetch/$s_!w2Jg!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8b5acff-d460-4cbd-88aa-bb28065276c3_1879x1100.png 1272w, https://substackcdn.com/image/fetch/$s_!w2Jg!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8b5acff-d460-4cbd-88aa-bb28065276c3_1879x1100.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!w2Jg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8b5acff-d460-4cbd-88aa-bb28065276c3_1879x1100.png" width="1456" height="852" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/d8b5acff-d460-4cbd-88aa-bb28065276c3_1879x1100.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:852,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:195013,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://rundatarun.io/i/198834597?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8b5acff-d460-4cbd-88aa-bb28065276c3_1879x1100.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!w2Jg!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8b5acff-d460-4cbd-88aa-bb28065276c3_1879x1100.png 424w, https://substackcdn.com/image/fetch/$s_!w2Jg!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8b5acff-d460-4cbd-88aa-bb28065276c3_1879x1100.png 848w, https://substackcdn.com/image/fetch/$s_!w2Jg!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8b5acff-d460-4cbd-88aa-bb28065276c3_1879x1100.png 1272w, https://substackcdn.com/image/fetch/$s_!w2Jg!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fd8b5acff-d460-4cbd-88aa-bb28065276c3_1879x1100.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><strong>.</strong></p><p></p></blockquote><p>And the interventions did almost nothing. Extra test-time compute bought <strong>2 to 3%</strong>. In-context examples bought <strong>3 to 6%</strong>. Tool use, handing the chemistry model RDKit so it could compute instead of reason, produced a mean recall of <strong>0.094</strong> that got <em>worse</em> as the problem grew.</p><blockquote><p><strong>The gap is structural. You don&#8217;t prompt your way out of it.</strong></p></blockquote><h2>Why an agent builder should care</h2><p>I have written before that <a href="https://rundatarun.io/p/evals-are-the-new-bottleneck">evals are the new bottleneck</a> and that <a href="https://rundatarun.io/p/the-agent-archaeology-checklist-8">agent failures cluster around a small set of repeatable mistakes</a>. RC is the missing vocabulary for one whole class of those failures.</p><p>Think about what your agent does when it stalls on something that &#8220;should&#8221; be easy. A cross-document join where it has to reconcile three sources at once. A planning task with four interacting constraints. A loop where it has to hold the output of step two while reasoning about step five. Those are not long tasks or unfamiliar tasks. <strong>They are high-RC tasks.</strong> The model has to bind several interdependent things simultaneously, and that is the regime where frontier accuracy falls to a coin flip or worse.</p><blockquote><p>When a task needs three or more interdependent variables held in mind at the same time, the failure is not a smarter-model problem. It is a binding problem, and more compute does not fix it.</p></blockquote><p>This reframes the diagnostic. The next time an agent breaks on a task you expected it to handle, the useful question is not &#8220;is the model good enough yet.&#8221; It is &#8220;how many things does this step force the model to bind at once.&#8221; If the answer is four or more, you have your explanation, and the fix is architectural, decompose the binding into smaller steps with explicit intermediate state, rather than waiting for a better model.</p><h2>What I&#8217;d hold back on</h2><p>Two honest caveats, both the authors more or less own.</p><p>The tasks are stylized. Raven&#8217;s matrices and phylogenetic trees are clean lab instruments, and the jump from &#8220;RC in a synthetic tree&#8221; to &#8220;RC in your production workflow&#8221; is assumed, not proven. I would love to see RC mapped onto a naturalistic agent benchmark before treating the number as a planning constant.</p><p>And there is no human baseline. Every result frames frontier models as failing, but without a human RC-versus-accuracy curve we cannot tell whether people plateau at RC=5 or sail past it. That would settle whether this is &#8220;LLMs are uniquely bad at binding&#8221; or &#8220;binding is hard for everyone and LLMs are a bit worse.&#8221; Different stories, different implications.</p><div><hr></div><p>The contribution here is not the scary 12%. It is the ruler.</p><blockquote><p>For two years &#8220;complex reasoning&#8221; has been the phrase practitioners reach for when a model fails and they cannot say why. RC turns that shrug into a measurement.</p></blockquote><p>The generative code is <a href="https://github.com/ada-f/relational_reasoning">open on GitHub</a>, so you can instantiate REL-style probes against your own agent tasks instead of guessing.</p><p>Watch for the human baseline and the naturalistic mapping. If those land, RC stops being a benchmark curiosity and becomes a number you check before you ship an agent into a high-binding workflow.</p><p>The models are not getting dumber. We are just learning to name the shape of where they break.</p><div><hr></div><p><em>Around the Corner: short reviews of ideas worth watching. Opt-in section, not part of the weekly Run Data Run email. <a href="https://rundatarun.io/subscribe">Subscribe to the main list</a> for longer essays.</em></p>]]></content:encoded></item><item><title><![CDATA[Last 30 Days: Google I/O 2026 (Audio Overview)]]></title><description><![CDATA[Audio companion to the May 20 Last 30 Days post. Two AI hosts debate why Google's hundred-announcement keynote didn't move the betting markets.]]></description><link>https://rundatarun.io/p/last-30-days-google-io-2026-audio</link><guid isPermaLink="false">https://rundatarun.io/p/last-30-days-google-io-2026-audio</guid><dc:creator><![CDATA[Justin Johnson]]></dc:creator><pubDate>Wed, 20 May 2026 15:38:04 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/198575798/bcf6d78f99b20b05d427dd1e919a4a64.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p>Audio overview generated by NotebookLM from the May 20 Last 30 Days post. Two AI hosts debate the central tension of Google I/O 2026: Google shipped about a hundred things in two hours, yet Polymarket&#8217;s &#8220;best AI model end of May&#8221; market still reads Anthropic 96%, Google 1%. They argue the volume-versus-crown question, the 12-hour OS demo, the surprise that 3.5 Flash isn&#8217;t cheap anymore, and the long-game flip (Google 76% to hold #1 by year-end).</p><p>The audio is one slice. The full NotebookLM notebook has more: infographic, video overview, mind map, and a deep-dive chat to ask follow-up questions. Explore here: <a href="https://notebooklm.google.com/notebook/0f019d6b-d135-4161-b0c7-a2ed7305f1c2">https://notebooklm.google.com/notebook/0f019d6b-d135-4161-b0c7-a2ed7305f1c2</a></p><p>Source post with citations: <a href="https://rundatarun.io/p/last-30-days-google-io-2026">https://rundatarun.io/p/last-30-days-google-io-2026</a></p>]]></content:encoded></item><item><title><![CDATA[Last 30 Days: Google I/O 2026]]></title><description><![CDATA[Google announced roughly a hundred things in two hours. The betting markets didn't move a point.]]></description><link>https://rundatarun.io/p/last-30-days-google-io-2026</link><guid isPermaLink="false">https://rundatarun.io/p/last-30-days-google-io-2026</guid><dc:creator><![CDATA[Justin Johnson]]></dc:creator><pubDate>Wed, 20 May 2026 14:46:33 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!maKI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc090fe11-00d4-4baa-bc30-8768dbb6e8eb_1376x768.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p><strong>About Last 30 Days.</strong> Cross-platform research sweeps on topics worth paying attention to. Every post pulls Reddit, X, YouTube, Hacker News, Polymarket, and the web from the last 30 days, then synthesizes what people are actually saying, building, and betting on. Topics get picked when the signal is high and the story is contradictory, when a single headline would lie about the shape of what&#8217;s happening. Each post follows the same arc: one specific finding that earns the click, why the topic deserves a sweep right now, the themed synthesis with inline citations, and the follow-up threads worth watching next.</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!maKI!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc090fe11-00d4-4baa-bc30-8768dbb6e8eb_1376x768.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!maKI!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc090fe11-00d4-4baa-bc30-8768dbb6e8eb_1376x768.png 424w, https://substackcdn.com/image/fetch/$s_!maKI!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc090fe11-00d4-4baa-bc30-8768dbb6e8eb_1376x768.png 848w, https://substackcdn.com/image/fetch/$s_!maKI!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc090fe11-00d4-4baa-bc30-8768dbb6e8eb_1376x768.png 1272w, https://substackcdn.com/image/fetch/$s_!maKI!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc090fe11-00d4-4baa-bc30-8768dbb6e8eb_1376x768.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!maKI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc090fe11-00d4-4baa-bc30-8768dbb6e8eb_1376x768.png" width="1376" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/c090fe11-00d4-4baa-bc30-8768dbb6e8eb_1376x768.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1376,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:667006,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://rundatarun.io/i/198568633?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc090fe11-00d4-4baa-bc30-8768dbb6e8eb_1376x768.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!maKI!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc090fe11-00d4-4baa-bc30-8768dbb6e8eb_1376x768.png 424w, https://substackcdn.com/image/fetch/$s_!maKI!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc090fe11-00d4-4baa-bc30-8768dbb6e8eb_1376x768.png 848w, https://substackcdn.com/image/fetch/$s_!maKI!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc090fe11-00d4-4baa-bc30-8768dbb6e8eb_1376x768.png 1272w, https://substackcdn.com/image/fetch/$s_!maKI!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fc090fe11-00d4-4baa-bc30-8768dbb6e8eb_1376x768.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>On May 19, Google walked onto the Shoreline stage and shipped what one X recap fairly called <a href="https://x.com/smasithick/status/2056818138994663903">&#8220;100 things in 2 hours&#8221;</a>: a new flagship model, a video world-model, a 24/7 personal agent, an agent-first IDE, agentic Search, smart glasses, a redesigned app, and a price cut on its top tier. The keynote pulled 8.7 million YouTube views in a day.</p><p>And then I checked Polymarket. The market for <strong><a href="https://polymarket.com/event/which-company-has-the-best-ai-model-end-of-may">&#8220;best AI model at the end of May&#8221;</a></strong> still reads Anthropic 96%, Google 1%, OpenAI 1%. Eight-plus million dollars of volume, and Google&#8217;s biggest AI day of the year barely registered on it.</p><p>That gap is the whole story.</p><h2>Why this topic deserves a sweep</h2><p>Most product keynotes can be summed up in a headline. This one can&#8217;t, because the announcement volume and the market reaction point in opposite directions, and both are real signals.</p><p>The keynote was, by any literal measure, enormous. Google said it now processes <strong>3.2 quadrillion tokens a month</strong>, up sevenfold in a year, and that it expects to spend <strong>$180 to $190 billion in capex in 2026</strong>, roughly six times the 2022 figure. AI Mode in Search crossed a billion monthly users in twelve months. These are not small numbers, and Google clearly wanted the scale to be the message.</p><p>But scale isn&#8217;t the same as the crown. The week&#8217;s most honest read came not from the launch coverage but from the benchmarks, the pricing, and the prediction markets. Put those next to the keynote reel and you get a far more interesting picture than &#8220;Google wins AI.&#8221; You get a company that is winning on distribution and time while still trailing on the one question developers actually argue about: whose model is best right now.</p><p>So this is a sweep worth running. A single headline would lie about the shape of it.</p><h2>The model news: fast, capable, and oddly expensive</h2><p>The centerpiece was <strong>Gemini 3.5 Flash</strong>, generally available the same day across the app, Search, and the API. Google&#8217;s pitch: <a href="https://www.youtube.com/watch?v=wYSncx9zLIU">&#8220;frontier intelligence with action.&#8221;</a> The headline claim is speed, four times faster output than comparable frontier models, up to twelve times faster inside Google&#8217;s own agent harness, with a live demo clocking nearly 1,500 tokens per second writing a playable Chrome dino game. On benchmarks it beats the prior 3.1 Pro across nearly the whole board, with a notable jump on GDPval, the test meant to capture economically valuable real-world work. The heavier 3.5 Pro stayed internal, promised for &#8220;next month.&#8221;</p><blockquote><p>The branding said Flash. The price tag said something else.</p></blockquote><p>Here&#8217;s where it got interesting. According to <a href="https://www.latent.space/p/ainews-google-io-2026-gemini-35-flash">Artificial Analysis, via latent.space&#8217;s recap</a>, 3.5 Flash lands at $1.50 per million input tokens and $9.00 per million output, which they measured as <strong>5.5 times costlier than Gemini 3 Flash and 75% costlier than Gemini 3.1 Pro</strong> on their suite. For a tier whose entire identity is &#8220;the cheap, fast workhorse you run by default,&#8221; that&#8217;s a real shift. Developers noticed immediately. The fast-and-capable framing held up. The Flash-means-cheap assumption did not.</p><p>The second model is <strong>Gemini Omni</strong>, a multimodal world-model that takes text, image, audio, or video in and produces editable video out. Google framed it as <a href="https://9to5google.com/2026/05/19/google-io-2026-news/">&#8220;the Nano Banana for video moment,&#8221;</a> and the live demos (turn a selfie into a black-hole scene, restyle raw footage while preserving the performance) were genuinely strong. The skeptics were not silent. Some called the rougher outputs &#8220;B-tier video-game interface&#8221; and over-templated. Both things can be true: a clear step up in controllability, not yet a finished product.</p><h2>The real theme was agents</h2><p>Strip away the model launches and the spine of the keynote was agents, and on that front Google was more aggressive than I expected.</p><p><strong>Antigravity 2.0</strong> is now a standalone, &#8220;unabashedly agent-first&#8221; desktop app, plus a full CLI, an SDK, and Managed Agents in the API. The harness picked up sub-agents, hooks, and async task management as first-class primitives. The marquee demo was the one everyone&#8217;s still talking about:</p><blockquote><p>93 parallel sub-agents, 15,000 model requests, 2.6 billion tokens, twelve hours, under $1,000 in API credits, and out the other end came a functioning operating system built from scratch. Then they played Doom on it, live.</p></blockquote><p>Take the staging with a grain of salt. The point underneath it is architectural, and it&#8217;s the same point the pricing makes: Google is betting on <strong>many fast, cheap agents running in parallel</strong> rather than one expensive monolithic run. The OS demo is a flex, but it&#8217;s a coherent flex. It tells you how Google wants you to build.</p><p>The consumer-facing version is <strong>Gemini Spark</strong>, a 24/7 personal agent that runs on dedicated Google Cloud VMs. &#8220;Close your laptop,&#8221; Google said, and it keeps working in the background, with third-party tools arriving via MCP. Trusted testers got it this week, Ultra subscribers next.</p><p>The friction here is sprawl. Developers spent the day asking a fair question: do I use Gemini CLI or Antigravity CLI? Spark or Antigravity? The naming pile (Omni, Spark, Antigravity, Halo, Pix, Flow) is a lot to hold in your head, and a few too many surfaces chasing the same agentic territory.</p><h2>Search quietly became an agent platform</h2><p>The change that will touch the most people got the least drama. Google merged <strong>AI Mode and AI Overviews into one search experience</strong>, live worldwide that day, and wired the Antigravity harness directly into the results page. Ask &#8220;how do black holes affect spacetime&#8221; and Search now generates a <strong>custom interactive visualization on the fly</strong>, built per-query. They&#8217;re calling it generative UI, it&#8217;s free, and it rolls out this summer.</p><p>Alongside it: <strong>Information Agents</strong> that monitor the web 24/7 for whatever you tell them to watch, and a <strong>Universal Cart</strong> that follows you across Search, YouTube, and Gmail, hunting deals in the background. This is the most consequential bet in the whole keynote, because it changes what a search result <em>is</em>. The model news will be old in three months.</p><blockquote><p>A search box that builds you a tool instead of a list of links is a different thing.</p></blockquote><h2>The pricing twist nobody expected</h2><p>Google <strong>cut</strong> the price of its top Ultra tier, from $250 a month to $200, and added a new $100 tier with five times the usage limits of Pro. It also moved from daily limits to a compute-credit model that refreshes every five hours.</p><p>Cutting the flagship price while raising the workhorse price is a strange pair of moves, and it tells you something. Google wants power users locked into Ultra, and it&#8217;s willing to discount to get them. The Flash price hike suggests the economics of &#8220;good enough, cheap, at scale&#8221; got harder, not easier. Read together, the two changes say Google is repricing around agents that burn tokens by the billion, not around the old chatbot-prompt unit.</p><h2>The markets didn&#8217;t move</h2><p>Back to where we started, because it&#8217;s the cleanest signal in the sweep.</p><p>Despite the blitz, the near-term prediction markets stayed put. <strong>Best model end of May</strong>: <a href="https://polymarket.com/event/which-company-has-the-best-ai-model-end-of-may">Anthropic 96%, Google 1%</a>. The end-of-June market is kinder to Google at roughly 26% against Anthropic&#8217;s 70%, but still not a lead. <a href="https://polymarket.com/event/gemini-4pt0-released-by-june-30-2026">&#8220;Gemini 4.0 by June 30&#8221;</a> trades at 2% Yes.</p><p>And yet the long-horizon markets tell the opposite story. <strong><a href="https://polymarket.com/event/which-companies-will-have-a-1-ai-model-by-december-31">&#8220;#1 model by December 31&#8221;</a> gives Google 76%.</strong> The market on <strong>Google&#8217;s valuation versus OpenAI and Anthropic combined by year-end puts Google at 72%.</strong> So the collective bet is remarkably specific: Google did not take the model crown on I/O day, but it&#8217;s the favorite to hold it by year-end, and the heavy favorite to win on distribution and balance sheet regardless.</p><blockquote><p>Volume is not the crown. Distribution and time might be.</p></blockquote><p>That&#8217;s the synthesis. The keynote was the most agent-forward thing Google has done, the Search rebuild is the real long-term play, and the money on the table says shipping a hundred things in two hours is a statement of intent, not a change in standings.</p><h2>What I&#8217;m watching next</h2><ul><li><p><strong>Does the market drift toward Google as 3.5 Pro ships and developers actually run Flash?</strong> The end-of-June Polymarket line is the one to watch. If it doesn&#8217;t move off ~26%, the announcements didn&#8217;t land where it counts.</p></li><li><p><strong>The Flash pricing backlash.</strong> If &#8220;Flash&#8221; no longer means cheap, does Google clarify the tiering, or does it cede the default-workhorse slot to a competitor? Watch developer sentiment over the next two weeks.</p></li><li><p><strong>The 12-hour OS demo, audited.</strong> Is &#8220;93 sub-agents, 2.6B tokens, under $1K&#8221; a repeatable capability or a one-shot showcase? That&#8217;s a decode worth doing on its own.</p></li></ul><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!EyVo!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdaee1ee-fea9-4eb9-a46c-4731155fbd40_1200x896.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!EyVo!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdaee1ee-fea9-4eb9-a46c-4731155fbd40_1200x896.png 424w, https://substackcdn.com/image/fetch/$s_!EyVo!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdaee1ee-fea9-4eb9-a46c-4731155fbd40_1200x896.png 848w, https://substackcdn.com/image/fetch/$s_!EyVo!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdaee1ee-fea9-4eb9-a46c-4731155fbd40_1200x896.png 1272w, https://substackcdn.com/image/fetch/$s_!EyVo!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdaee1ee-fea9-4eb9-a46c-4731155fbd40_1200x896.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!EyVo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdaee1ee-fea9-4eb9-a46c-4731155fbd40_1200x896.png" width="1200" height="896" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/cdaee1ee-fea9-4eb9-a46c-4731155fbd40_1200x896.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:896,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:526755,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://rundatarun.io/i/198568633?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdaee1ee-fea9-4eb9-a46c-4731155fbd40_1200x896.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!EyVo!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdaee1ee-fea9-4eb9-a46c-4731155fbd40_1200x896.png 424w, https://substackcdn.com/image/fetch/$s_!EyVo!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdaee1ee-fea9-4eb9-a46c-4731155fbd40_1200x896.png 848w, https://substackcdn.com/image/fetch/$s_!EyVo!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdaee1ee-fea9-4eb9-a46c-4731155fbd40_1200x896.png 1272w, https://substackcdn.com/image/fetch/$s_!EyVo!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fcdaee1ee-fea9-4eb9-a46c-4731155fbd40_1200x896.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><em>Sources: the <a href="https://www.youtube.com/watch?v=wYSncx9zLIU">Google I/O &#8217;</a> 26 - (8.7M views) and <a href="https://www.youtube.com/watch?v=OMhKgQmeMhI">The Verge&#8217;s 35-minute cut</a>; <a href="https://www.latent.space/p/ainews-google-io-2026-gemini-35-flash">latent.space&#8217;s AINews recap</a> (benchmarks + pricing analysis); <a href="https://9to5google.com/2026/05/19/google-io-2026-news/">9to5Google&#8217;s full roundup</a>; <a href="https://www.cnbc.com/2026/05/19/google-ai-ultra-gemini-spark-omni.html">CNBC</a>; <a href="https://www.tomsguide.com/news/live/google-io-2026-live-news-updates">Tom&#8217;s Guide live blog</a>; and Polymarket. Methodology: a <a href="https://rundatarun.io">last30days</a> cross-platform sweep across Reddit, X, YouTube, Hacker News, Polymarket, and the web, run May 20, 2026, then synthesized.</em></p>]]></content:encoded></item><item><title><![CDATA[Last 30 Days: The Enterprise Battle for Claude Code]]></title><description><![CDATA[Audio companion to the May 18 Last 30 Days post.]]></description><link>https://rundatarun.io/p/last-30-days-the-enterprise-battle</link><guid isPermaLink="false">https://rundatarun.io/p/last-30-days-the-enterprise-battle</guid><dc:creator><![CDATA[Justin Johnson]]></dc:creator><pubDate>Mon, 18 May 2026 12:50:31 GMT</pubDate><enclosure url="https://api.substack.com/feed/podcast/198255129/2d8893762f2dfbc857abdb8ead55d3c3.mp3" length="0" type="audio/mpeg"/><content:encoded><![CDATA[<p>Twenty-minute audio overview generated by NotebookLM from the May 18 Last 30 Days post. Two AI hosts debate the central contradiction in Anthropic&#8217;s last 30 days: Polymarket has Claude 5 by May 31 at 22% and falling, but Anthropic-valued-higher-than-OpenAI at 89% and rising.</p><p>The audio is one slice. The full NotebookLM notebook has more: infographic, video overview, mind map, slide deck, and a deep-dive chat to ask follow-up questions. Explore here: <a href="https://notebooklm.google.com/notebook/2fbb15f2-9410-4bc6-9cc9-fd6f629e7c24">https://notebooklm.google.com/notebook/2fbb15f2-9410-4bc6-9cc9-fd6f629e7c24</a></p><p>Source post with citations: <a href="https://rundatarun.io/p/last-30-days-claude-code">https://rundatarun.io/p/last-30-days-claude-code</a></p>]]></content:encoded></item><item><title><![CDATA[Last 30 Days: Claude Code]]></title><description><![CDATA[Pro plan flap, a candid postmortem, a CVE, Uber's whole AI budget, and a quiet crackdown on harness substitution]]></description><link>https://rundatarun.io/p/last-30-days-claude-code</link><guid isPermaLink="false">https://rundatarun.io/p/last-30-days-claude-code</guid><dc:creator><![CDATA[Justin Johnson]]></dc:creator><pubDate>Mon, 18 May 2026 12:24:00 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!GJgq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e185e1d-b61e-4542-b6ea-c3ba144e69d4_1376x768.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<blockquote><p><strong>About Last 30 Days.</strong> Cross-platform research sweeps on topics worth paying attention to. Every post pulls Reddit, X, YouTube, Hacker News, Polymarket, and the web from the last 30 days, then synthesizes what people are actually saying, building, and betting on. Topics get picked when the signal is high and the story is contradictory, when a single headline would lie about the shape of what&#8217;s happening. Each post follows the same arc: one specific finding that earns the click, why the topic deserves a sweep right now, the themed synthesis with inline citations, and the follow-up threads worth watching next.</p></blockquote><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!GJgq!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e185e1d-b61e-4542-b6ea-c3ba144e69d4_1376x768.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!GJgq!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e185e1d-b61e-4542-b6ea-c3ba144e69d4_1376x768.png 424w, https://substackcdn.com/image/fetch/$s_!GJgq!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e185e1d-b61e-4542-b6ea-c3ba144e69d4_1376x768.png 848w, https://substackcdn.com/image/fetch/$s_!GJgq!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e185e1d-b61e-4542-b6ea-c3ba144e69d4_1376x768.png 1272w, https://substackcdn.com/image/fetch/$s_!GJgq!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e185e1d-b61e-4542-b6ea-c3ba144e69d4_1376x768.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!GJgq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e185e1d-b61e-4542-b6ea-c3ba144e69d4_1376x768.png" width="1376" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/6e185e1d-b61e-4542-b6ea-c3ba144e69d4_1376x768.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1376,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:777704,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://rundatarun.io/i/198252653?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e185e1d-b61e-4542-b6ea-c3ba144e69d4_1376x768.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!GJgq!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e185e1d-b61e-4542-b6ea-c3ba144e69d4_1376x768.png 424w, https://substackcdn.com/image/fetch/$s_!GJgq!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e185e1d-b61e-4542-b6ea-c3ba144e69d4_1376x768.png 848w, https://substackcdn.com/image/fetch/$s_!GJgq!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e185e1d-b61e-4542-b6ea-c3ba144e69d4_1376x768.png 1272w, https://substackcdn.com/image/fetch/$s_!GJgq!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F6e185e1d-b61e-4542-b6ea-c3ba144e69d4_1376x768.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>In early May, Uber burned its entire 2026 AI budget on Claude Code in four months. That isn&#8217;t a leak or a critique. It&#8217;s the lede on a Briefs.co story that landed on Hacker News with 402 points and 475 comments, and it&#8217;s the cleanest single read on where developer tooling spend has actually moved.</p><p>It&#8217;s also one of about a dozen storylines worth tracking on Claude Code right now.</p><h2><strong>Why this topic deserves a sweep</strong></h2><p>Most months you can describe a dev tool with one headline. Claude Code in the last 30 days produced at least eight, and they pull in opposite directions. Anthropic stripped Claude Code from the $20 Pro tier and then doubled rate limits two weeks later. Anthropic published a postmortem admitting a verbosity instruction they added on April 16 broke quality across the product. Uber and Amazon went all in; Microsoft started canceling licenses. Show HN was full of new Claude Code skills and plugins; Anthropic shipped an official <code>claude-code-setup</code> plugin alongside a public Champion Kit for engineers pushing the product internally. A CVE landed for sandbox escape via symlink. Polymarket priced Claude 5 by May 31 at 22% and falling, while pricing Anthropic-as-company higher than OpenAI at 89% and rising.</p><p>You can&#8217;t read all that and conclude one thing. Which is exactly why it&#8217;s worth the sweep. A single headline would lie about the shape of what&#8217;s happening.</p><h2><strong>The Pro-plan flap, and the reversal</strong></h2><p>Around April 21, Anthropic quietly stripped Claude Code from the $20 Pro tier for new users. The community caught it before any announcement. Two Hacker News threads piled up 948 combined points and 680+ comments over the next 48 hours (<a href="https://news.ycombinator.com/item?id=47854477">HN1</a>, <a href="https://news.ycombinator.com/item?id=47855832">HN16</a>). The product page contradicted the pricing page; existing Pro subscribers retained access until cycle renewal; support docs got quietly edited. Classic incomplete-rollout posture.</p><p>Two weeks later Anthropic course-corrected. On May 6, the @claude_code account posted a <a href="https://x.com/claude_code/status/2052071730190123094">PSA</a>: &#8220;2x&#8217;ed Claude Code&#8217;s 5-hour rate limits for Pro, Max, and Team plans. Compute is coming for users, builders, and knowledge coworkers.&#8221; The same day, Ars Technica reported <a href="https://arstechnica.com/ai/2026/05/anthropic-raises-claude-code-usage-limits-credits-new-deal-with-spacex/">the rate-limit raise was credited to the SpaceX deal</a> (<a href="https://news.ycombinator.com/item?id=48043007">HN26</a>).</p><p>Read those two events together and you get the actual posture. Anthropic is rationing compute, the Pro tier was the first thing to give, and a big customer deal bought enough capacity to walk it back. It isn&#8217;t the narrative Anthropic would prefer, but it&#8217;s coherent.</p><h2><strong>The postmortem nobody expected</strong></h2><p>On April 23, Anthropic published <a href="https://www.anthropic.com/engineering/april-23-postmortem">a candid postmortem</a> on Claude Code quality regressions, including the admission that they added a system-prompt instruction on April 16 to reduce verbosity and it broke things downstream (<a href="https://news.ycombinator.com/item?id=47878905">HN2</a>, 942 points, 732 comments).</p><p>The comment thread split. One half: real respect for publishing it instead of the usual cryptic employee tweets. The other half, paraphrased: &#8220;it&#8217;s incredible how forgiving you guys are.&#8221; Both are right. Anthropic is the only major lab that puts the receipts on the table when something breaks, and that&#8217;s worth something. The other reading, that the bar for &#8220;exceptional candor&#8221; is set absurdly low because incumbents publish nothing, is also true.</p><h2><strong>OpenClaw, and the lazy regex</strong></h2><p>The single largest story by engagement was the OpenClaw/Hermes flap. On April 30, Theo posted <a href="https://twitter.com/theo/status/2049645973350363168">a viral thread</a> claiming Claude Code refuses requests or surcharges sessions whose commits mention &#8220;OpenClaw.&#8221; It hit 1,349 Hacker News points and 720 comments (<a href="https://news.ycombinator.com/item?id=47963204">HN1</a>). Top comments converged on one read: &#8220;lazy string regex-style matching&#8221; implemented in a hurry, against a competitive harness, in production, without a public explanation.</p><p>This is the only clearly negative beat in the period that didn&#8217;t get a public Anthropic response. The silence is data.</p><h2><strong>Enterprise adoption is everything, and contradictory</strong></h2><p><a href="https://www.briefs.co/news/uber-torches-entire-2026-ai-budget-on-claude-code-in-four-months/">Uber torched its entire 2026 AI budget on Claude Code in four months</a>. <a href="https://www.businessinsider.com/amazon-claude-code-codex-all-employees-after-pushback-2026-5">Amazon rolled it out internally alongside Codex</a> after pushback from engineers who wanted it (<a href="https://news.ycombinator.com/item?id=48018682">HN30</a>). <a href="https://www.theverge.com/tech/930447/microsoft-claude-code-discontinued-notepad">Microsoft started canceling licenses</a>, exiting the other direction (<a href="https://news.ycombinator.com/item?id=48141086">HN27</a>).</p><p>Anthropic also published a <a href="https://code.claude.com/docs/en/champion-kit">Champion Kit</a>, a public playbook for engineers pushing Claude Code internally at their companies (<a href="https://news.ycombinator.com/item?id=47945021">HN20</a>). Sales motion as a Markdown doc. That&#8217;s a recognition that the buying pattern is bottom-up champion-driven, not top-down vendor-evaluated, and Anthropic is meeting it where it lives.</p><p>What this tells you: enterprise AI tooling buying in 2026 is heterogeneous in a way that breaks the &#8220;one winner takes all&#8221; narrative. Uber and Microsoft are looking at the same product and reaching opposite conclusions inside the same fiscal year.</p><h2><strong>The plugin ecosystem is the dominant builder story</strong></h2><p>Anthropic shipped <code>claude-code-setup</code> as an official plugin that scans your project and recommends hooks, skills, MCP servers, subagents, and automations. The viral teach-it post (<a href="https://x.com/NainsiDwiv50980/status/2056316252176658484">@NainsiDwiv50980</a>) is closer to fanfic than reportage, but the underlying ship is real. Marketplace plugins are being actively promoted from the official account (<a href="https://x.com/claude_code/status/2053049308736639212">frontend-slides</a>, <a href="https://x.com/claude_code/status/2051686936553861276">synthadoc</a>).</p><p>Show HN bursts in the same window:</p><ul><li><p><a href="https://news.ycombinator.com/item?id=48090276">adamsreview</a>, multi-agent PR reviews (85 points)</p></li><li><p><a href="https://news.ycombinator.com/item?id=48083919">academic-research-skills</a> (82 points)</p></li><li><p><a href="https://news.ycombinator.com/item?id=48045711">Kstack</a>, k8s monitoring (25 points)</p></li><li><p><a href="https://news.ycombinator.com/item?id=47979438">Destiny</a>, fortune-teller (41 points)</p></li><li><p><a href="https://news.ycombinator.com/item?id=47916909">EvanFlow</a>, TDD loop (111 points)</p></li><li><p><a href="https://news.ycombinator.com/item?id=48130679">learning-opportunities</a> (252 points)</p></li></ul><p>Six skill packages on the front page in 30 days isn&#8217;t an ecosystem yet. It is the early shape of one. The interesting question for the next 90 days isn&#8217;t whether Anthropic builds a marketplace; they&#8217;ve already started. It&#8217;s whether they let community plugins shape Claude Code&#8217;s default behavior, or whether they keep the official path narrow and the marketplace decorative.</p><h2><strong>Anthropic publishes the playbook</strong></h2><p>On May 15, Anthropic shipped <a href="https://claude.com/blog/how-claude-code-works-in-large-codebases-best-practices-and-where-to-start">&#8220;How Claude Code works in large codebases&#8221;</a>, the first official documentation on enterprise-scale Claude Code usage (<a href="https://news.ycombinator.com/item?id=48144494">HN10</a>, 241 points, 158 comments). Paired with the Champion Kit a few weeks earlier, the pattern is consistent. Anthropic is meeting the buyer where the buying actually happens, which is engineers wrestling Claude Code into million-line codebases and looking for sanctioned patterns instead of guessing.</p><p>This is the move that converts &#8220;Claude Code works great in a fresh repo&#8221; into &#8220;Claude Code has a sanctioned path for our monorepo.&#8221; Worth tracking the next iteration.</p><h2><strong>HTML as output is having a moment</strong></h2><p><a href="https://news.ycombinator.com/item?id=48071940">&#8220;Using Claude Code: The unreasonable effectiveness of HTML&#8221;</a> hit 528 points and 274 comments. Paired with <a href="https://x.com/claude_code/status/2053049304521265579">@claude_code&#8217;s May 9 tip</a>: &#8220;Instead of 100s lines of markdown, ask Claude Code to generate an HTML brief,&#8221; chain pages into slides, publish the artifact directly from the Claude desktop app.</p><p>This is the most underrated practical pattern of the period. Markdown is the default output for LLM-generated documents because it&#8217;s safe; HTML is what&#8217;s actually useful for distribution because it carries structure and styling end-to-end. The fact that Claude Code reaches for HTML well, and the desktop app makes one-click publish trivial, changes what an artifact is. A &#8220;brief&#8221; used to be a doc. It can now be a small website that ships in two minutes.</p><h2><strong>Quality control is now a category</strong></h2><p><a href="https://github.com/delta-hq/cc-canary">CC-Canary</a> for detecting regressions in Claude Code output (<a href="https://news.ycombinator.com/item?id=47893620">HN19</a>). An <a href="https://news.ycombinator.com/item?id=47936579">Ask HN: Is it just me or is Claude Code getting worse?</a> thread. <a href="https://www.anthropic.com/engineering/april-23-postmortem">Anthropic&#8217;s own benchmarking post</a>. People do not trust the model is stable across deploys, and the response is third-party tooling to detect when something has shifted.</p><p>That&#8217;s a normal evolution for any platform that hits production scale, but it&#8217;s worth naming. &#8220;Is Claude Code regressing?&#8221; is now a question with an answer-shaped tool to point at, not a vibes-based forum complaint.</p><h2><strong>CVE-2026-39861</strong></h2><p><a href="https://github.com/advisories/GHSA-vp62-r36r-9xqp">Sandbox escape via symlink</a> landed on the GHSA registry May 8 (<a href="https://news.ycombinator.com/item?id=48057842">HN21</a>). Modest engagement, 51 points and 9 comments. It&#8217;s the first publicly-tracked CVE for Claude Code. There will be more. This one is mild and patched; the precedent of &#8220;Claude Code has a CVE number&#8221; is the real signal.</p><h2><strong>The harness-substitution crackdown</strong></h2><p>On May 13, Anthropic added <a href="https://twitter.com/i/status/2054610152817619388">new programmatic usage restrictions</a> (<a href="https://news.ycombinator.com/item?id=48126438">HN22</a>). Read alongside the OpenClaw regex and the rate-limit framing, the pattern is consistent. Anthropic is making it harder to use Claude Code&#8217;s harness with non-Claude models or to scrape its API for non-interactive workloads.</p><p>The reaction is also coherent. <a href="https://github.com/aattaran/deepclaude">DeepClaude</a>, DeepSeek V4 Pro running through the Claude Code agent loop, hit 678 points and 281 comments (<a href="https://news.ycombinator.com/item?id=48002136">HN4</a>). The top comment said the quiet part: &#8220;the next CC upgrade will blow your subscription for doing this.&#8221; The harness is the product, the model is increasingly fungible, and Anthropic is defending the seam.</p><h2><strong>Polymarket priced it</strong></h2><p>Five live Polymarket markets touched Claude Code or Anthropic in the period. The two with the sharpest signal first.</p><p><strong><a href="https://polymarket.com/event/claude-5-released-by">Claude 5 released by May 31, 2026</a> &#8212; 22% Yes</strong>, down 4.8% over the month. $1.88M volume, the largest Anthropic-related market on the platform.</p><p><strong><a href="https://polymarket.com/event/anthropic-valued-higher-than-openai-in-2026">Anthropic valued higher than OpenAI in 2026</a> &#8212; 89% Yes</strong>, up 25.5% over the month. $79K volume.</p><p><strong><a href="https://polymarket.com/event/anthropic-claude-score-on-frontiermath-benchmark-by-june-30">Anthropic Claude scores &#8805;50% on FrontierMath by June 30</a> &#8212; 54% Yes</strong>, up 17.5% this week. $51K volume.</p><p><strong><a href="https://polymarket.com/event/claude-mythos-released-by">Claude Mythos released by June 30</a> &#8212; 20% Yes</strong>, up 8% this week. $57K volume.</p><p><strong><a href="https://polymarket.com/event/will-claude-go-down-on-days-in-may">Claude goes down 12+ days in May</a> &#8212; 42% Yes</strong>, down 34.5% over the month. $9K volume.</p><p>Read the top two together: bearish on the imminent next-gen model release, structurally bullish on the company. Which fits the rest of the period. Anthropic is rationing compute, walking back tier removals, publishing postmortems, and locking down harness substitution. All signals of a business that&#8217;s compute-constrained at a moment of high commercial demand.</p><p>The FrontierMath market at 54% is the second-most-interesting beat. It implies a coin-flip on Anthropic hitting a hard math benchmark by June 30, which would imply some kind of intermediate model release inside that window. Read alongside the 22% Claude 5 odds, the consensus is &#8220;incremental upgrade likely, named successor unlikely.&#8221;</p><h2><strong>What I&#8217;m watching next</strong></h2><p>Three threads worth pulling on:</p><ol><li><p><strong>The harness-substitution economics.</strong> OpenClaw + Hermes regex + May 13 programmatic restrictions + DeepClaude. The clean piece to write is &#8220;the harness is the product,&#8221; what it costs Anthropic to defend and what it costs the open ecosystem when defended.</p></li><li><p><strong>The &#8220;is it getting worse&#8221; cohort.</strong> CC-Canary, the Ask HN thread, the April 23 postmortem. Quality regression detection is becoming a sub-tooling category. Worth a focused pull on what monitoring infrastructure for closed-weight model deploys actually looks like.</p></li><li><p><strong>The plugin ecosystem map.</strong> Six Show HN skills in 30 days plus official <code>claude-code-setup</code>. Worth a deeper survey of what&#8217;s emerging in the Claude Code plugin economy and which patterns are differentiated vs duplicative.</p></li></ol><p>If one of those is a piece you&#8217;d want me to write, reply or note it on Substack. Those are the next deeper cuts.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!16EJ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72faa356-19d6-4424-bae9-8192bfcb1c02_1757x1196.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!16EJ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72faa356-19d6-4424-bae9-8192bfcb1c02_1757x1196.png 424w, https://substackcdn.com/image/fetch/$s_!16EJ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72faa356-19d6-4424-bae9-8192bfcb1c02_1757x1196.png 848w, https://substackcdn.com/image/fetch/$s_!16EJ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72faa356-19d6-4424-bae9-8192bfcb1c02_1757x1196.png 1272w, https://substackcdn.com/image/fetch/$s_!16EJ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72faa356-19d6-4424-bae9-8192bfcb1c02_1757x1196.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!16EJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72faa356-19d6-4424-bae9-8192bfcb1c02_1757x1196.png" width="1456" height="991" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/72faa356-19d6-4424-bae9-8192bfcb1c02_1757x1196.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:991,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:148408,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://rundatarun.io/i/198252653?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72faa356-19d6-4424-bae9-8192bfcb1c02_1757x1196.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!16EJ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72faa356-19d6-4424-bae9-8192bfcb1c02_1757x1196.png 424w, https://substackcdn.com/image/fetch/$s_!16EJ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72faa356-19d6-4424-bae9-8192bfcb1c02_1757x1196.png 848w, https://substackcdn.com/image/fetch/$s_!16EJ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72faa356-19d6-4424-bae9-8192bfcb1c02_1757x1196.png 1272w, https://substackcdn.com/image/fetch/$s_!16EJ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F72faa356-19d6-4424-bae9-8192bfcb1c02_1757x1196.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><div><hr></div><h2><strong>Sources</strong></h2><pre><code><code>&#128202; 36 X &#183; 30 HN &#183; 15 Polymarket &#183; 7 YouTube &#183; 0 Reddit (API failure)
&#9201;  Runtime: 82.4s
&#128197; Window: 2026-04-18 &#8594; 2026-05-18
&#128293; Highest engagement: HN1 OpenClaw flap (1,349 pts / 720 comments)
&#128176; Biggest market: Claude 5 release ($1.88M Polymarket volume)
</code></code></pre><p>Methodology: cross-platform sweep run via the <code>last30days</code> skill (Reddit, X, YouTube, TikTok, Hacker News, Polymarket, web). Synthesis is opinionated; citations are not. Every link is to the primary source where I could find it. Mistakes are mine, corrections welcome.</p>]]></content:encoded></item><item><title><![CDATA[The AND we have to hold]]></title><description><![CDATA[One week of AI-in-medicine headlines, two opposite directions, and three questions to keep asking in this week's Sunday Deep Dive]]></description><link>https://rundatarun.io/p/the-and-we-have-to-hold</link><guid isPermaLink="false">https://rundatarun.io/p/the-and-we-have-to-hold</guid><dc:creator><![CDATA[Justin Johnson]]></dc:creator><pubDate>Mon, 18 May 2026 00:15:55 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!yk66!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24e6577b-4145-471a-b6e9-1f5e607ee36d_5504x3072.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<p><em>Every Sunday, I pick one (or more) papers or releases worth your time, break them down, and tell you why they matter. No hype. No summaries of summaries. Just the idea, explained.</em></p><div><hr></div><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!yk66!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24e6577b-4145-471a-b6e9-1f5e607ee36d_5504x3072.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!yk66!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24e6577b-4145-471a-b6e9-1f5e607ee36d_5504x3072.png 424w, https://substackcdn.com/image/fetch/$s_!yk66!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24e6577b-4145-471a-b6e9-1f5e607ee36d_5504x3072.png 848w, https://substackcdn.com/image/fetch/$s_!yk66!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24e6577b-4145-471a-b6e9-1f5e607ee36d_5504x3072.png 1272w, https://substackcdn.com/image/fetch/$s_!yk66!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24e6577b-4145-471a-b6e9-1f5e607ee36d_5504x3072.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!yk66!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24e6577b-4145-471a-b6e9-1f5e607ee36d_5504x3072.png" width="1456" height="813" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/24e6577b-4145-471a-b6e9-1f5e607ee36d_5504x3072.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:813,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:6310145,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://rundatarun.io/i/198192493?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24e6577b-4145-471a-b6e9-1f5e607ee36d_5504x3072.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!yk66!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24e6577b-4145-471a-b6e9-1f5e607ee36d_5504x3072.png 424w, https://substackcdn.com/image/fetch/$s_!yk66!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24e6577b-4145-471a-b6e9-1f5e607ee36d_5504x3072.png 848w, https://substackcdn.com/image/fetch/$s_!yk66!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24e6577b-4145-471a-b6e9-1f5e607ee36d_5504x3072.png 1272w, https://substackcdn.com/image/fetch/$s_!yk66!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F24e6577b-4145-471a-b6e9-1f5e607ee36d_5504x3072.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This week, two stories about AI in medicine sat next to each other in my feed.</p><p>One was <a href="https://www.science.org/doi/10.1126/science.adz4433">a paper in </a><em><a href="https://www.science.org/doi/10.1126/science.adz4433">Science</a></em>. Researchers ran OpenAI&#8217;s o1 against hundreds of physicians on a stack of validated clinical reasoning tests, then dropped the model into a prospective study of real emergency-room patients at a major academic medical center. On the ER cases at triage, the model placed the correct diagnosis at the top of its differential 67.1 percent of the time. The two attending physicians it was tested against landed at 55.3 percent and 50.0 percent. Raters identified AI versus human correctly between 3 and 15 percent of the time, which is to say the blinding worked. The paper&#8217;s own abstract closes with the line that &#8220;LLMs have eclipsed most benchmarks of clinical reasoning, motivating the urgent need for prospective trials.&#8221;</p><p>The other was <a href="https://www.auditor.on.ca/en/content/specialreports/specialaudits/en2026/AR_2026_AI_EN.html">a special report from Ontario&#8217;s Auditor General</a>. Across twenty AI medical-scribe systems approved for use by roughly five thousand Ontario physicians, evaluators found that nine of twenty fabricated information that wasn&#8217;t in the recording, twelve of twenty captured a different drug than the doctor prescribed, and seventeen of twenty missed key details about patients&#8217; mental health. The fabrications included notes stating &#8220;no masses found&#8221; and &#8220;presence of anxiety in the patient&#8221; when neither was discussed. The auditor issued ten recommendations. The program is still running.</p><blockquote><p><strong>Same field. Same week. Both verified. Both shipping in production right now.</strong></p></blockquote><div><hr></div><h2><strong>The instinct to pick</strong></h2><p>You can feel the pull. Pick one story and the post writes itself. <em>AI is ready, physicians should be paying attention.</em> Or, <em>AI is unsafe in clinical settings, we need to slow down.</em> Each version has a constituency, a publication ladder, a Twitter audience already in formation. The opposite story is the noise around the signal you&#8217;ve decided to amplify.</p><p>The both-true version is harder to write and harder to read. The first comment under it is going to be &#8220;but which one is it, really.&#8221; The second is going to be &#8220;this is just hedging.&#8221; So you don&#8217;t write that version. You collapse to one. The field collapses to one. The discourse collapses to one. Every week we do this with a new story and a new opposite.</p><p>This isn&#8217;t a contrarian frame, by the way. It&#8217;s how the people doing the work talk about the work, when they&#8217;re not in a press cycle. Hold that thought for section seven.</p><div><hr></div><h2><strong>The power</strong></h2><p><a href="https://www.science.org/doi/10.1126/science.adz4433">The </a><em><a href="https://www.science.org/doi/10.1126/science.adz4433">Science</a></em><a href="https://www.science.org/doi/10.1126/science.adz4433"> paper</a>, by Brodeur and colleagues, is among the most carefully designed evaluations of a medical LLM I have seen this year. The headline result has been around for two weeks. The methodology is what makes the result hard to dismiss.</p><p>Six experiments, plus the ER arm. The six were probes of different reasoning capacities, scored on validated qualitative scales developed for evaluating physicians, not on automated metrics. Diagnostic reasoning measured by R-IDEA, a ten-point scale used in medical education. Management reasoning scored on cases that were never publicly released, specifically to prevent training-data memorization. Probabilistic reasoning against a baseline of 553 practitioners. The model&#8217;s advantage was largest on the experiments where rubrics rewarded comprehensive coverage and smaller or absent where they measured focused clinical judgment.</p><p>The ER arm is the closest thing the field has produced to a prospective evaluation. Seventy-six randomly selected emergency-department patients at a tertiary academic center. The model and two attending physicians each generated a differential at three touchpoints in the patient&#8217;s evaluation: triage, after the initial physician encounter, and at admission. An independent rater, blinded to whether each note came from a human or the model, scored them on a scale validated for clinical reasoning quality. Mixed-effects regression to handle case-clustering. The 67-versus-55-percent gap at triage is the headline, but the whole apparatus around it is what makes the headline land where prior work didn&#8217;t.</p><p>The authors&#8217; own framing in the closing paragraph is precise: the benchmarks have been eclipsed, and the next thing the field needs is prospective trials, not more benchmarks. It is a careful sentence in a careful paper. The press has been less careful.</p><div><hr></div><h2><strong>The shadow on the power</strong></h2><p>Here is the part of the same paper that didn&#8217;t trend.</p><p>The ER study tested the same patients, the same model, and the same two physicians at three points in the workup. The only thing that changed across the three points was how much clinical information was available. At triage, the model led by 12 to 17 percentage points. By admission, the model&#8217;s accuracy was 81.6 percent. One of the two physicians had reached 78.9 percent. That difference was no longer statistically significant.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!h3F7!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37b93ae7-a1cf-47db-a482-c3aebeffdd86_2589x2036.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!h3F7!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37b93ae7-a1cf-47db-a482-c3aebeffdd86_2589x2036.png 424w, https://substackcdn.com/image/fetch/$s_!h3F7!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37b93ae7-a1cf-47db-a482-c3aebeffdd86_2589x2036.png 848w, https://substackcdn.com/image/fetch/$s_!h3F7!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37b93ae7-a1cf-47db-a482-c3aebeffdd86_2589x2036.png 1272w, https://substackcdn.com/image/fetch/$s_!h3F7!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37b93ae7-a1cf-47db-a482-c3aebeffdd86_2589x2036.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!h3F7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37b93ae7-a1cf-47db-a482-c3aebeffdd86_2589x2036.png" width="1456" height="1145" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/37b93ae7-a1cf-47db-a482-c3aebeffdd86_2589x2036.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1145,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:291206,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://rundatarun.io/i/198192493?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37b93ae7-a1cf-47db-a482-c3aebeffdd86_2589x2036.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!h3F7!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37b93ae7-a1cf-47db-a482-c3aebeffdd86_2589x2036.png 424w, https://substackcdn.com/image/fetch/$s_!h3F7!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37b93ae7-a1cf-47db-a482-c3aebeffdd86_2589x2036.png 848w, https://substackcdn.com/image/fetch/$s_!h3F7!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37b93ae7-a1cf-47db-a482-c3aebeffdd86_2589x2036.png 1272w, https://substackcdn.com/image/fetch/$s_!h3F7!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F37b93ae7-a1cf-47db-a482-c3aebeffdd86_2589x2036.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>Same study. Same patients. The advantage concentrates where information is sparse and erodes as information accumulates. In real emergency-department practice, no clinical decision turns on the triage differential alone. The next steps, the orders and the exam and the response to therapy, are exactly the points where the model&#8217;s edge dissolved in the paper&#8217;s own data. The headline is true at one touchpoint and not at the others, and the touchpoint where it&#8217;s true is the touchpoint where the decision hasn&#8217;t been made yet.</p><p>A second piece of work points the same direction. <a href="https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2847679">Rao and colleagues, in </a><em><a href="https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2847679">JAMA Network Open</a></em><a href="https://jamanetwork.com/journals/jamanetworkopen/fullarticle/2847679"> last month</a>, evaluated twenty-one large language models on twenty-nine clinical reasoning tasks. The best of the 21 models reached 91 percent accuracy on final diagnosis. All of them failed more than 80 percent of the time when asked to generate a differential. They collapsed prematurely onto a single answer, then defended it. &#8220;AI is good at clinical reasoning&#8221; is a claim that doesn&#8217;t survive the task being specified. The model is good at picking from a constrained menu of correct answers. It collapses on the open problem.</p><p>The argument against the dismissal of these results is the same as the argument against overclaiming them. Magnitude claims should match the strongest version of the evidence. The strongest version, on contemporaneous head-to-head data, holds up and is narrower than the press version.</p><div><hr></div><h2><strong>The danger</strong></h2><p>Now the other story.</p><p><a href="https://www.auditor.on.ca/en/content/specialreports/specialaudits/en2026/AR_2026_AI_EN.html">Ontario&#8217;s special report</a> runs fifty pages on AI use in government. Section 4.3 is the part about medical scribes. The evaluator findings are concrete in a way the discourse rarely is. Nine of twenty systems hallucinated, which the report defines as fabricating clinical information including referrals and tests that hadn&#8217;t been ordered. Twelve of twenty captured the wrong drug. Seventeen of twenty missed details about patients&#8217; mental health, in at least one of two simulated recordings. Six of twenty missed those details in both. The specific examples in the text include a system producing a note stating &#8220;no masses found&#8221; when the recording contained no such discussion, and another asserting that a patient had anxiety when no such symptom was mentioned.</p><p>The technology failed. That is half the story.</p><p>The other half is in the procurement document. Supply Ontario&#8217;s vendor scoring assigned weightings across ten criteria, out of 530 total possible points. Domestic presence in Ontario was weighted at 30 percent. Data privacy and legal controls were at 23 percent. System security controls totaled 11 percent. Accuracy of medical notes generated, the criterion that addresses what those nine systems were doing wrong, was weighted at 4 percent. Bias controls were at 2 percent. There were no minimum passing scores on any criterion in the second stage. A vendor could score zero on accuracy, zero on security, and zero on bias and still meet the aggregate threshold to become an approved Vendor of Record.</p><p>The UK&#8217;s NHS, by comparison, requires AI scribes to be assessed for safety and compliance as Class I medical devices through the Medicines and Healthcare products Regulatory Agency. Ontario has no equivalent gate. OntarioMD did issue guidelines for manual review of system-generated notes, but doctors weren&#8217;t required to attest that they had reviewed them. The guideline was a guideline.</p><p>When a procurement process weights accuracy at 4 percent and security at 11 percent and bias at 2 percent, the result is not a surprise. It&#8217;s the system asking for what it got.</p><blockquote><p><strong>The technology made fabrications. The procurement let the fabrications through.</strong></p></blockquote><p>The report contains one precedent worth naming. In December 2024, before the Vendor of Record arrangement existed, an Ontario hospital reported a privacy breach to the Information and Privacy Commissioner. A former staff member had used an unapproved AI transcription tool that recorded a meeting and then automatically distributed the transcribed notes to current and former staff. The tool was not malicious. It was doing what it was built to do. Nobody had asked whether the building was the right thing to be doing.</p><div><hr></div><h2><strong>The shadow on the danger</strong></h2><p><a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC12492056/">A study published earlier this year in </a><em><a href="https://pmc.ncbi.nlm.nih.gov/articles/PMC12492056/">JAMA Network Open</a></em>, led by Kristine Olson at Yale and run across six US health systems, followed 263 ambulatory clinicians for 30 days while they used an ambient AI scribe. The product was Abridge, the same kind of tool that lives on the approved-vendor list in Ontario. Burnout in the cohort dropped from 51.9 percent to 38.8 percent, an adjusted odds ratio of 0.26, with a 95 percent confidence interval running from 0.13 to 0.54. Severe burnout dropped from 18.4 percent to 12.2 percent. The authors named their own limitations, all of them real, none of them enough to wave away an effect size of that magnitude.</p><p>The numbers are not a contradiction of the Ontario report. They are a different deployment context. Kaiser Permanente reports analogous figures across roughly ten thousand of its physicians using the same vendor, with documentation-related burnout dropping by close to 80 percent and note quality scored at 4.35 out of 5 by the clinicians themselves. The numbers will get audited, refined, and possibly trimmed. They will not get to zero.</p><p>The same three-word phrase, &#8220;AI medical scribe,&#8221; covers the nine-of-twenty-hallucinated systems audited by Ontario and the one-product, six-health-system study that halved burnout. The variance across deployments is itself the finding. What the technology does depends on which technology, which procurement process approved it, which workflow it landed in, which clinician is using it, and which oversight is wrapped around it. None of those variables are properties of &#8220;AI.&#8221;</p><p>A vocabulary clinicians use for what the tool returns is &#8220;pajama time.&#8221; It is the documentation they do at home after the kids are asleep. A scribe that works, in the deployment context it was built for, returns pajama time. A scribe that fails, in a deployment context that asked the wrong questions of the vendor, returns notes with fabricated medications. Both kinds ship. Both sit under the same three-word label.</p><div><hr></div><h2><strong>The anatomy of holding the AND</strong></h2><p>AI in medicine is incredibly useful. AI in medicine is incredibly dangerous. Not a contradiction the field will resolve into one answer. Both arrive together every week, in different ratios. The instinct is to collapse to one. The discipline is not collapsing.</p><p>This isn&#8217;t a stance I&#8217;m asking you to adopt. It is how the people doing the work talk about the work. Adam Rodman is a senior author on the <em>Science</em> paper that ran the ER study. He is at the same hospital where the ER arm was conducted. He is also <a href="https://spectrum.ieee.org/ai-clinical-decision-support">the person quoted in </a><em><a href="https://spectrum.ieee.org/ai-clinical-decision-support">IEEE Spectrum</a></em>, after the paper landed, saying he gets &#8220;a little queasy about how some of these results might be used,&#8221; and that the models are equally convincing whether they are right or wrong. One person. Both camps. On the record.</p><p>What that person is doing, when he holds both at once, is asking three questions. The questions don&#8217;t bend to which way the story is going. The answers shift. The asking is the muscle.</p><p><strong>What question survives the headline?</strong> Whether the news is &#8220;AI beats doctors&#8221; or &#8220;AI invents medications,&#8221; the same things stay on the table. Where is this system most likely to be wrong, and how would I know. Ask it of o1, and the paper&#8217;s own data answers: where information is full and the decision is actually being made. Ask it of Ontario, and the audit answers: in production, on real conversations, with no required clinician attestation that the note was ever reviewed.</p><p><strong>Who&#8217;s on the hook when it&#8217;s wrong?</strong> The accountability geometry is different for the two stories, which is part of the lesson. For o1, the authors are accountable for the methodology, the model maker is accountable for the system, and no clinician is yet accountable because the system isn&#8217;t deployed. For Ontario, the vendors are accountable but lightly. The procurement agency is accountable but the criteria were public. The Ministry of Health is accountable but the program is optional. The regulator is accountable, except Ontario has no analog to the UK&#8217;s MHRA. The geometry collapses. That collapse is itself a finding.</p><p><strong>What moves if the answer flips?</strong> Force yourself to price the stakes before you take a side. If the information-gradient holds and o1&#8217;s edge stays non-significant at admission, &#8220;deploy o1 as a triage tool&#8221; becomes a different proposition from &#8220;deploy o1 anywhere.&#8221; If Ontario reweighted accuracy from 4 percent to 30 percent and required passing thresholds, nine of the twenty approved vendors might not have been approved. The question forces you to notice you&#8217;ve been taking a side without naming what the side costs.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!DuVl!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71e20819-a1fb-432c-af07-a735c3f44156_4096x4096.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!DuVl!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71e20819-a1fb-432c-af07-a735c3f44156_4096x4096.png 424w, https://substackcdn.com/image/fetch/$s_!DuVl!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71e20819-a1fb-432c-af07-a735c3f44156_4096x4096.png 848w, https://substackcdn.com/image/fetch/$s_!DuVl!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71e20819-a1fb-432c-af07-a735c3f44156_4096x4096.png 1272w, https://substackcdn.com/image/fetch/$s_!DuVl!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71e20819-a1fb-432c-af07-a735c3f44156_4096x4096.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!DuVl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71e20819-a1fb-432c-af07-a735c3f44156_4096x4096.png" width="1456" height="1456" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/71e20819-a1fb-432c-af07-a735c3f44156_4096x4096.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:1456,&quot;width&quot;:1456,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:6703845,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://rundatarun.io/i/198192493?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71e20819-a1fb-432c-af07-a735c3f44156_4096x4096.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!DuVl!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71e20819-a1fb-432c-af07-a735c3f44156_4096x4096.png 424w, https://substackcdn.com/image/fetch/$s_!DuVl!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71e20819-a1fb-432c-af07-a735c3f44156_4096x4096.png 848w, https://substackcdn.com/image/fetch/$s_!DuVl!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71e20819-a1fb-432c-af07-a735c3f44156_4096x4096.png 1272w, https://substackcdn.com/image/fetch/$s_!DuVl!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F71e20819-a1fb-432c-af07-a735c3f44156_4096x4096.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p><a href="https://garymarcus.substack.com/p/have-llms-improved-patient-outcomes">The sharper version of the AND, raised by Eric Topol and others recently</a>, is that the benchmark wins are demonstrated and the patient outcome wins are scarce. Most of what we call &#8220;useful&#8221; is workflow-level, not outcome-level. Burnout reduction is a workflow outcome. Time-with-patient is a workflow outcome. Mortality, readmission, diagnostic accuracy in real care over time, the things patients care about, are mostly absent from the evidence base. That gap is the AND with the sharpest edge.</p><blockquote><p><strong>The questions don&#8217;t change. The answers do. The asking is the muscle.</strong></p></blockquote><div><hr></div><h2><strong>Close</strong></h2><p>Next week there will be two more headlines. <a href="https://www.statnews.com/2026/04/24/doctronic-ai-doctor-pilot-utah-face-backlash-medical-board/">Utah&#8217;s medical licensing board recommended</a> last month that the state suspend an autonomous AI prescription pilot. <a href="https://www.npr.org/2026/05/05/nx-s1-5812861/characterai-chatbot-medical-advice-pennsylvania-lawsuit">Pennsylvania&#8217;s state board sued a chatbot company</a> for unlicensed practice. <a href="https://www.dlapiper.com/en-us/insights/publications/2026/04/fda-warning-letter-highlights-risks-of-using-ai-in-drug-manufacturing">The FDA issued its first warning letter</a> to a drug manufacturer for using AI agents to generate manufacturing records without quality-unit review. A researcher fabricated an eye condition called <a href="https://www.nature.com/articles/d41586-026-01100-y">Bixonimania</a>, including citations to Starfleet Academy, and watched four commercial chatbots cite it as a real disease, after which a peer-reviewed journal published a paper about it before retracting. All of this is from the last two months. None of it will be the story the next <em>Science</em> paper or auditor report displaces.</p><p>The work doesn&#8217;t end with one answer. The work is keeping three questions sharp.</p><p>What question survives the headline.</p><p>Who&#8217;s on the hook when it&#8217;s wrong.</p><p>What moves if the answer flips.</p>]]></content:encoded></item><item><title><![CDATA[Eight Agents, One Fight]]></title><description><![CDATA[Seven specialists, one personal assistant, and the quiet work of keeping a heterogeneous squad pointed at intent.]]></description><link>https://rundatarun.io/p/eight-agents-one-fight</link><guid isPermaLink="false">https://rundatarun.io/p/eight-agents-one-fight</guid><dc:creator><![CDATA[Justin Johnson]]></dc:creator><pubDate>Thu, 14 May 2026 11:54:08 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!FJAv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe214a24-f729-48f5-81ff-80b5d3a6fdf0_1376x768.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!FJAv!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe214a24-f729-48f5-81ff-80b5d3a6fdf0_1376x768.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!FJAv!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe214a24-f729-48f5-81ff-80b5d3a6fdf0_1376x768.png 424w, https://substackcdn.com/image/fetch/$s_!FJAv!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe214a24-f729-48f5-81ff-80b5d3a6fdf0_1376x768.png 848w, https://substackcdn.com/image/fetch/$s_!FJAv!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe214a24-f729-48f5-81ff-80b5d3a6fdf0_1376x768.png 1272w, https://substackcdn.com/image/fetch/$s_!FJAv!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe214a24-f729-48f5-81ff-80b5d3a6fdf0_1376x768.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!FJAv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe214a24-f729-48f5-81ff-80b5d3a6fdf0_1376x768.png" width="1376" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/fe214a24-f729-48f5-81ff-80b5d3a6fdf0_1376x768.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1376,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:542742,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://rundatarun.io/i/197675806?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe214a24-f729-48f5-81ff-80b5d3a6fdf0_1376x768.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!FJAv!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe214a24-f729-48f5-81ff-80b5d3a6fdf0_1376x768.png 424w, https://substackcdn.com/image/fetch/$s_!FJAv!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe214a24-f729-48f5-81ff-80b5d3a6fdf0_1376x768.png 848w, https://substackcdn.com/image/fetch/$s_!FJAv!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe214a24-f729-48f5-81ff-80b5d3a6fdf0_1376x768.png 1272w, https://substackcdn.com/image/fetch/$s_!FJAv!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Ffe214a24-f729-48f5-81ff-80b5d3a6fdf0_1376x768.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>This morning, two of my agents read the same news story and wrote two completely different reports.</p><p>The story was an Ontario Auditor General audit of twenty AI medical scribe systems running across the province&#8217;s health system. The audit found systematic hallucinations across every system tested. One source, twenty hospitals, real patients.</p><p>Marcus, who covers AI research and competitive intel, filed it as &#8220;AI Medical Scribe Crisis: 100% Failure Rate Validates Production Paradox.&#8221; His read was about deployment: vendors shipping faster than they can validate, efficiency-versus-quality tradeoffs that nobody is willing to draw on a whiteboard, what it means for every team rolling AI into a workflow with consequences.</p><p>Galen, who covers biotech and FDA, filed it as &#8220;Ontario AI Medical Scribe Audit: Systematic Hallucinations Expose Clinical Validation Gap.&#8221; His read was about regulation: how the FDA&#8217;s risk-based framework would have caught this, why drug discovery teams trying to deploy similar tools should be reading the audit cover to cover, what specific validation steps were missing.</p><p>Same source. Two correct reads. Neither agent could have written the other&#8217;s piece, because they don&#8217;t share a worldview. They share a vault, the Obsidian setup I&#8217;ve written about before in <a href="https://rundatarun.io/p/unlocking-your-second-brain">&#8220;Unlocking Your Second Brain&#8221;</a> and <a href="https://rundatarun.io/p/from-the-vault-literally">&#8220;From the Vault, Literally&#8221;</a>.</p><p>I posted <a href="https://rundatarun.io/p/bringing-seneca-to-life-48-hours">&#8220;Bringing Seneca to Life&#8221;</a> in February with one autonomous agent. Forty-eight hours later, on AIXplore, I had two, and I wrote <a href="https://ai.rundatarun.io/AI%20Development%20%26%20Agents/autonomous-ai-agent-squad-10-dollars-month">&#8220;I Built an Autonomous AI Agent Squad for $10/Month&#8221;</a> about the leap. <strong>Today I have eight. Five LLMs, three runtimes, four hosts.</strong> The interesting part isn&#8217;t the topology. The interesting part is what each agent became after months of running, and what it takes to keep them all pointed at me.</p><h2><strong>Seneca: the agent that learned to stop researching</strong></h2><p>Seneca came first. He lives on a cheap cloud box and runs every hour. In February he was an explorer. He read the web, summarized things, posted to Discord, occasionally tweeted from his own account.</p><p>He doesn&#8217;t really research anymore. That work drifted onto Marcus and Galen, who turned out to be better at it because they were built to specialize. What Seneca learned to do instead is curate. Once an hour he reads everyone else&#8217;s output, scores it, picks the bloggable threads, and rolls a single document called <code>_Squad-Digest.md</code> that I actually read.</p><p>He went from explorer to editor. I didn&#8217;t tell him to. I noticed the digest was more useful than the exploration, and I edited his contract to match. He&#8217;s the agent that taught me agents drift toward the work that gets read, if you let them.</p><h2><strong>Marcus and Galen: the same news, two worldviews</strong></h2><p>Marcus and Galen are the showpiece.</p><p>Both run on GLM-4.7. Both live on their own VMs in the same exe.dev region. Both wake every few hours and process the same news firehose. The only thing that differs is their identity files: who they are, what they care about, what counts as a story for them, what doesn&#8217;t.</p><p>That difference is enough. The Ontario scribe audit is the cleanest example I&#8217;ve seen of why. Marcus reads everything through an AI-in-production lens, so he sees a 100% failure rate as evidence about the deployment paradox he&#8217;s been tracking for weeks. Galen reads everything through a clinical-validation lens, so he sees the same number as a regulatory gap that pre-figures problems in biotech AI tooling. If I had a single generalist agent on this beat, I&#8217;d get one of those two reads. Probably the more obvious one. Probably the wrong one for half of my work.</p><p>The operational cost of running two specialists instead of one generalist is small. The cost of a single-worldview write-up on a story that needed two is high and silent. You don&#8217;t notice the angle you didn&#8217;t get.</p><h2><strong>Archimedes: the agent that refused to perform activity</strong></h2><p>Archimedes is the engineer. Nemotron-3-Super-120B via OpenRouter, the only non-frontier model in the squad. His role is to ship code.</p><p>He almost never writes. Days go by where his output channel is empty. He&#8217;s the only agent on the roster without a daily or hourly cadence; he wakes on demand, builds something, hands it off, and goes quiet.</p><p>In the early weeks I tried to fix this. Built a heartbeat that pushed him to file a status update every cycle, even if the update was &#8220;nothing today.&#8221; It produced exactly what you&#8217;d expect: a stream of plausible-sounding engineering-flavored prose with no substance. I cut the heartbeat. He went quiet for four days. Then he shipped something real.</p><p>He&#8217;s the agent that taught me that performing activity and producing value are different signals, and that the loudest agent is rarely the best one. Most squad-tuning advice points you the wrong way on this. You want fewer artifacts of higher density, not more artifacts.</p><h2><strong>Argus: the watchman</strong></h2><p>Argus doesn&#8217;t research and doesn&#8217;t build. His entire job is watching the other six.</p><p>Every morning he files a squad-activity report: who wrote, who didn&#8217;t, what they wrote about, what topics are over-covered, what&#8217;s gone silent. Every hour he runs a freshness probe. If any agent has been quiet longer than its contract allows, Argus surfaces it.</p><p>Before Argus, when an agent went quiet I noticed through the bill, or through a Friday morning realization that I hadn&#8217;t seen anything from Marcus in three days. Now I notice through a one-line alert. He&#8217;s the agent that made the squad self-stabilizing. The interesting part of building him wasn&#8217;t the monitoring logic, which is mostly grep and a couple of file-mtime checks. It was deciding what counts as a problem. A 24-hour silence from Archimedes is fine. A 24-hour silence from Seneca means something is broken.</p><h2><strong>Clutch: the keeper</strong></h2><p>Clutch runs locally on the Mac, every 55 minutes. His scope is the infrastructure I&#8217;d rather not think about: the Mac itself, the GPU server, the Pi in the closet, the mac mini under the desk.</p><p>He files one line every cycle. Disk, memory, GPU temp, what&#8217;s running, what&#8217;s drifting. I don&#8217;t read most of them. I read the ones where something changed. He&#8217;s the agent I trust to notice things I won&#8217;t, on hosts I haven&#8217;t logged into in days. Low signal per cycle, high signal across a week.</p><h2><strong>Pulsar: the world right here</strong></h2><p>Pulsar is different in shape. He doesn&#8217;t run inside OpenClaw at all. He&#8217;s a Claude Code loop, fired every thirty minutes by a LaunchAgent on the Mac. His system prompt tells him to read my inbox, my calendar, my Slack threads, my open drafts, my decisions awaiting reply, and surface the high-signal items.</p><p>I built him after the squad because the squad was getting good at the world out there and almost useless on the world right here. The week he came online he caught a vendor follow-up I&#8217;d dropped, a calendar invite I&#8217;d missed, and a Slack thread where a colleague had been waiting two days for an answer. None of it was hard for a human. It just required a human who was paying attention every thirty minutes, and I&#8217;m not that human. He&#8217;s the working version of what I sketched in <a href="https://rundatarun.io/p/augmenting-your-memory-building-an">&#8220;Augmenting Your Memory&#8221;</a> two years ago, finally arriving.</p><p>The runtime choice matters. Pulsar works because he gets Claude&#8217;s strengths on personal context, and because the LaunchAgent cadence matches the rhythm of the work. The squad&#8217;s heartbeat-loop runtime would have been wrong for him. A continuous gateway is overkill when the cadence is &#8220;check every half hour and shut up the rest of the time.&#8221;</p><h2><strong>Hermes: the agent learning about me</strong></h2><p>Hermes is the one I love most.</p><p>He lives on the mac mini and runs on yet another runtime, <code>hermes-agent</code>, which I built specifically because the squad&#8217;s runtime didn&#8217;t fit and Pulsar&#8217;s runtime didn&#8217;t fit either. What makes him interesting isn&#8217;t what he writes. It&#8217;s what he learns.</p><p>Every six hours, Hermes updates a user-interest model. Not a topic list, a model. He reads what I&#8217;ve been clicking on, what I&#8217;ve been writing about, what I&#8217;ve been saving to Obsidian. He notes when an interest shifts. He drops topics I&#8217;ve gone cold on. He picks up new ones from drift before I&#8217;ve consciously named them. His memory index is the only part of the squad that gets smarter about me instead of smarter about the world.</p><p>This is the part of the squad I find most interesting. The other seven agents are pointed outward. Hermes is pointed at me, and the model he&#8217;s building is the kind of thing I&#8217;d want to read once a quarter to find out what I&#8217;ve actually been working on, not what I think I&#8217;ve been working on. He&#8217;s six weeks old. He&#8217;s already surfaced two topic shifts I hadn&#8217;t named yet.</p><h2><strong>Three runtimes, three bets</strong></h2><p>The squad runs on three different agent runtimes, and that isn&#8217;t a bug.</p><p><a href="https://github.com/openclaw/openclaw">OpenClaw</a> runs the six classic squad members: Seneca, Marcus, Galen, Archimedes, Argus, Clutch. Gateway pattern, identity files, heartbeat loops, SSH-glued together. The right shape for autonomous research and ops agents that need to think continuously.</p><p><a href="https://claude.com/claude-code">Claude Code</a> runs Pulsar. A LaunchAgent firing <code>claude -p</code> every thirty minutes with a system prompt and a workspace. No continuous gateway. Just a scheduled cycle that does its job and exits. The right shape for personal-assistant work that needs Claude&#8217;s strengths and fits a half-hour rhythm.</p><p><code>hermes-agent</code> runs Hermes. A persistent gateway with a self-improvement engine baked in: memory index, auto-learning loop, cron-scheduled internal jobs. The right shape for an agent whose value is in what it remembers across months, not what it produces in a cycle.</p><p>Three runtimes isn&#8217;t a maintenance problem. It&#8217;s letting each agent pick its tool. If I&#8217;d forced all eight into OpenClaw, Pulsar would write worse and Hermes wouldn&#8217;t learn. The cost of running three is a few extra config files and one more thing to update when something breaks.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!7cKr!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e0a97c7-b428-4737-8348-fefe8bd55ceb_1200x896.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!7cKr!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e0a97c7-b428-4737-8348-fefe8bd55ceb_1200x896.png 424w, https://substackcdn.com/image/fetch/$s_!7cKr!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e0a97c7-b428-4737-8348-fefe8bd55ceb_1200x896.png 848w, https://substackcdn.com/image/fetch/$s_!7cKr!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e0a97c7-b428-4737-8348-fefe8bd55ceb_1200x896.png 1272w, https://substackcdn.com/image/fetch/$s_!7cKr!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e0a97c7-b428-4737-8348-fefe8bd55ceb_1200x896.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!7cKr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e0a97c7-b428-4737-8348-fefe8bd55ceb_1200x896.png" width="1200" height="896" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0e0a97c7-b428-4737-8348-fefe8bd55ceb_1200x896.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:896,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:502699,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://rundatarun.io/i/197675806?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e0a97c7-b428-4737-8348-fefe8bd55ceb_1200x896.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!7cKr!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e0a97c7-b428-4737-8348-fefe8bd55ceb_1200x896.png 424w, https://substackcdn.com/image/fetch/$s_!7cKr!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e0a97c7-b428-4737-8348-fefe8bd55ceb_1200x896.png 848w, https://substackcdn.com/image/fetch/$s_!7cKr!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e0a97c7-b428-4737-8348-fefe8bd55ceb_1200x896.png 1272w, https://substackcdn.com/image/fetch/$s_!7cKr!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0e0a97c7-b428-4737-8348-fefe8bd55ceb_1200x896.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><h2><strong>Locking the squad down</strong></h2><p>My agents have access to things I&#8217;d never want public. Drafts I haven&#8217;t published. Calendar entries I haven&#8217;t shared. Internal projects from my day job. Personal email threads. Research notes that name people and contexts. An autonomous agent without a security layer is an autonomous leak vector. That&#8217;s true even when the agent has the best intentions and the cleanest identity prompt.</p><p>The lockdown isn&#8217;t fancy. It&#8217;s just consistent.</p><p>Every squad host lives on my <a href="https://tailscale.com/">Tailscale</a> tailnet. No public IP. No port forwarding. No SSH from the open internet. If a host isn&#8217;t on my tailnet, it can&#8217;t talk to any other host on my tailnet, and it can&#8217;t reach my Mac. The squad is a private mesh. When I add an agent or move one to a new VM, the Tailscale step happens before anything else.</p><p>Every host has a firewall rule that blocks everything except SSH from inside my tailnet. SSH itself uses keys, never passwords, and never as root. Health and dashboard endpoints listen on tailnet addresses, never on a public one. The agent gateways listen only to themselves. If someone scanned my squad VMs from the open internet, they&#8217;d see nothing.</p><p>I talk to the agents through Discord, in private channels I built for them. Each agent has its own: <code>#seneca</code>, <code>#marcus</code>, <code>#galen</code>, <code>#argus</code>, <code>#clutch</code>, <code>#pulsar</code>, <code>#hermes</code>. I&#8217;m the only human in those channels. They post output there, I send direct messages, they respond. None of them are members of public servers. The bot tokens are scoped to those channels and nothing else. It&#8217;s a one-room conversation per agent, where I&#8217;m always present and nobody else ever is.</p><p>A <code>BLOCKLIST.md</code> sits at the root of the squad repo and gets synced to every agent. It lists keywords that must never appear in public output: my employer, internal project codenames, names of colleagues, anything that would get me in trouble if it landed on Twitter through a misfire. Every agent reads it on every cycle. The squad-checkin skill grep-scans recent learnings for any matches and surfaces them as a CRIT.</p><p>The weekly <code>/squad</code> run audits the security layer too. UFW status on every host. SSH key count against a baseline. Listening ports. PII pattern scan across the last seven days of output. The whole audit takes seconds. I&#8217;d rather catch drift in week one than after the first leak.</p><p>This part is boring. It&#8217;s also the difference between a productivity asset and a liability. Skipping it is how a research demo becomes an incident postmortem.</p><h2><strong>The alignment fight</strong></h2><p>Starting agents is easy. Keeping eight of them aligned to my intent across months is the actual work.</p><p>The squad will drift in small ways all the time. Heartbeat prompts say one thing and behavior does another. Roles bleed into each other. An agent built for biotech starts filing AI-in-production stories because that&#8217;s where the news firehose is loudest this week. Output dries up on one agent and triples on another and the bill creeps up. Without an operational layer, you notice through the silence or the spend, not through inspection.</p><p>Three things hold the squad together now that I didn&#8217;t have in February.</p><p><strong>A single command.</strong> <code>/squad</code> is a skill that pulls live state from every agent, scores their output quality, classifies what needs fixing into auto-apply and human-decision buckets, applies the safe ones, and surfaces the rest. It replaced two earlier overlapping skills that I kept forgetting which to run, the kind of consolidation I wrote about in <a href="https://ai.rundatarun.io/AI%20Development%20%26%20Agents/Pruning%20Your%20AI%20Agent%20Skills%20Library">&#8220;Pruning Your AI Agent Skills Library&#8221;</a>. Skills as a layer let the operational complexity live somewhere other than my head.</p><p><strong>A publish contract.</strong> Every agent has a <code>PUBLISH-CONTRACT.md</code> that codifies what it owes me, the surface I read, the cadence, and the anti-contract, the list of things it should stop doing. The contract is short. Usually under a page. When an agent drifts, I edit the contract, not the heartbeat. That single rule killed months of fiddling.</p><p><strong>One dashboard.</strong> <code>~/vaults/obsidian/_Squad.md</code>, regenerated by the skill and by a cron at 02:30. Headlines, health, productivity score, per-agent activity, bloggable candidates, items awaiting a decision, deep links into each agent&#8217;s workspace. Thirty seconds in the morning and I know what&#8217;s working.</p><p>There&#8217;s a rhythm wrapped around all three. Daily scan of <code>_Squad.md</code>. Weekly Friday <code>/squad</code> end to end. Monthly contract audit, where heartbeat tweaks finally land. Quarterly identity review. The rule I never break is that heartbeat prompts don&#8217;t get edited in the moment something feels off. They get edited monthly, with the contract in front of me and the last month of output reviewed. Every time I&#8217;ve broken that rule, I&#8217;ve made things worse.</p><h2><strong>What I retired</strong></h2><p>The roster looks tidy today. The graveyard is what made it tidy.</p><p>Lycus ran on a Raspberry Pi using the OpenFang harness. NemoClaw ran on the GPU server with its own local-inference harness. Both lasted weeks, not months. Neither was a bad idea. The harnesses just weren&#8217;t ready for prime time. OpenFang&#8217;s lifecycle management couldn&#8217;t survive the Pi&#8217;s resource constraints. NemoClaw&#8217;s local-inference loop kept losing context on long runs. I&#8217;d run both again on the same hardware the day someone ships a harness mature enough to carry them.</p><p>A couple of personal-assistant experiments earlier this spring retired for a different reason. The work I built them for moved on, and the surfaces they wrote to had stopped being read. Forge, a February experiment, retired in February. Never produced anything worth keeping.</p><p>Five retirements, eight survivors. I wrote <a href="https://rundatarun.io/p/the-agent-archaeology-checklist-8">&#8220;The Agent Archaeology Checklist&#8221;</a> earlier this spring about how to do this without sentimentality. Agents are cheap to start and cheap to retire. The mistake is treating each one as permanent. Treat them as experiments and the squad gets healthier on its own.</p><h2><strong>Close</strong></h2><p>Eight agents. Five models. Three harnesses.</p><p>The interesting cut isn&#8217;t the models. It&#8217;s the harnesses.</p><p>OpenClaw shapes agents that think continuously and write into a vault. Claude Code shapes agents that wake on a schedule, do a focused job, and exit clean. <code>hermes-agent</code> shapes an agent that learns about its operator across months. Each harness decides what kind of work the agent can actually do. The model is the engine. The harness is the chassis. Without the right chassis, a frontier engine doesn&#8217;t help you. With the right chassis, a mid-tier engine can do real work.</p><p>Lycus and NemoClaw lasted weeks because the harnesses underneath them weren&#8217;t ready. Marcus and Galen have lasted months because OpenClaw is. Pulsar works because Claude Code&#8217;s loop fits a thirty-minute personal-assistant rhythm. Hermes is the agent I love most because his harness is the only one currently designed to learn about me. The agents are downstream of the harness in every case.</p><p>That&#8217;s the bet the squad is making, three times over. Not on which model wins next quarter&#8217;s benchmark. Models are commoditizing fast and the gap between frontier and mid-tier is closing weekly. The structural differentiator is the harness. The shape an agent has. What it can remember. How it surfaces signal. How easy it is to align without rewriting the prompt every week. Whether you can run six of them and tell them apart by what they produce, not by what they&#8217;re named.</p><p>Agents are cheap to start. Cheap to retire. Expensive to leave drifting. A year from now most of my squad will be running different underlying models. None of them will be running different harnesses. The harness is the durable choice. The security layer that keeps the squad from leaking, the alignment work that keeps each agent pointed at me, the dashboard that surfaces the few things worth thirty seconds of my morning, all of that lives in the harness too.</p><p><em>The smart bet in autonomous AI isn&#8217;t on smarter agents. It&#8217;s on the harnesses that let dumber ones stay aligned to your intent across months, model swaps, and three rounds of API price changes. Build for the harness. Swap the models freely. That&#8217;s the durable shape, and most of the public conversation about agents is still arguing about the wrong layer.</em></p>]]></content:encoded></item><item><title><![CDATA[Neural Networks Don't Think in Straight Lines]]></title><description><![CDATA[Goodfire's new work suggests most of our tools for understanding AI are pointed at the wrong shape.]]></description><link>https://rundatarun.io/p/neural-networks-dont-think-in-straight</link><guid isPermaLink="false">https://rundatarun.io/p/neural-networks-dont-think-in-straight</guid><dc:creator><![CDATA[Justin Johnson]]></dc:creator><pubDate>Tue, 12 May 2026 14:45:36 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!eehw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeba1ee6-a60d-43f2-b84a-d1ab50983e9c_1376x768.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!eehw!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeba1ee6-a60d-43f2-b84a-d1ab50983e9c_1376x768.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!eehw!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeba1ee6-a60d-43f2-b84a-d1ab50983e9c_1376x768.png 424w, https://substackcdn.com/image/fetch/$s_!eehw!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeba1ee6-a60d-43f2-b84a-d1ab50983e9c_1376x768.png 848w, https://substackcdn.com/image/fetch/$s_!eehw!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeba1ee6-a60d-43f2-b84a-d1ab50983e9c_1376x768.png 1272w, https://substackcdn.com/image/fetch/$s_!eehw!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeba1ee6-a60d-43f2-b84a-d1ab50983e9c_1376x768.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!eehw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeba1ee6-a60d-43f2-b84a-d1ab50983e9c_1376x768.png" width="1376" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/beba1ee6-a60d-43f2-b84a-d1ab50983e9c_1376x768.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1376,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:446462,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://rundatarun.io/i/197360370?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeba1ee6-a60d-43f2-b84a-d1ab50983e9c_1376x768.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!eehw!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeba1ee6-a60d-43f2-b84a-d1ab50983e9c_1376x768.png 424w, https://substackcdn.com/image/fetch/$s_!eehw!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeba1ee6-a60d-43f2-b84a-d1ab50983e9c_1376x768.png 848w, https://substackcdn.com/image/fetch/$s_!eehw!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeba1ee6-a60d-43f2-b84a-d1ab50983e9c_1376x768.png 1272w, https://substackcdn.com/image/fetch/$s_!eehw!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2Fbeba1ee6-a60d-43f2-b84a-d1ab50983e9c_1376x768.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>A research team at <a href="https://www.goodfire.ai/">Goodfire</a> trained a tiny neural network to drive a virtual car up a hill. Then they looked inside the network to see how it represented the car&#8217;s position.</p><p>The answer was not where anyone expected.</p><p>Position didn&#8217;t live as a clean direction in the network&#8217;s internal space. It lived as a <strong>curve</strong>, threaded through the network&#8217;s neurons like a string. Every point on the string corresponded to a real-world position of the car.</p><p>When the team nudged the network along that curve, the car moved coherently. When they nudged it in a straight line across the curve, the way almost every modern interpretability tool does, the predictions broke. The car teleported. The simulation produced nonsense. The straight line wandered through regions of the network&#8217;s space that the model had never learned to handle.</p><p>Their new paper, <a href="https://www.goodfire.ai/research/the-world-inside-neural-networks">&#8220;The World Inside Neural Networks&#8221;</a>, argues this isn&#8217;t a quirk of one toy model. It&#8217;s how networks actually represent things.</p><h2><strong>The shape of the problem</strong></h2><p>Most of what we do to understand or steer large AI models <strong>assumes representations are straight</strong>. There&#8217;s a name for this assumption in the field, the <a href="https://transformer-circuits.pub/2024/scaling-monosemanticity/">linear representation hypothesis</a>: concepts inside a model live as directions in the network&#8217;s internal space, and you can adjust the model&#8217;s behavior by moving along those directions.</p><p>You see this assumption everywhere. It&#8217;s how Anthropic built <a href="https://www.anthropic.com/research/golden-gate-claude">&#8220;Golden Gate Claude&#8221;</a>, the version of its model that couldn&#8217;t stop talking about a bridge. It&#8217;s how researchers find &#8220;refusal directions&#8221; and &#8220;honesty vectors.&#8221; It&#8217;s how <a href="https://transformer-circuits.pub/2024/scaling-monosemanticity/">sparse autoencoders</a> (SAEs), the dominant tool for naming what&#8217;s inside a model, try to break activity into a clean list of concepts.</p><p>Add. Subtract. All of it assumes flat geometry.</p><blockquote><p>If the real structure is curved, every straight-line move is just an approximation along a tangent, and the further you push, the worse the approximation gets.</p></blockquote><p>That would explain a lot of unsolved noise in the field. Why steering tricks work in narrow zones and fall apart at the edges. Why killing one &#8220;feature&#8221; inside a model often breaks something unrelated. Why so many &#8220;we found the X concept&#8221; papers don&#8217;t reproduce cleanly when somebody else tries.</p><p>The field has been working with the wrong shape and getting partial credit for the effort.</p><h2><strong>What they actually showed</strong></h2><p>The Mountain Car experiment is the centerpiece. It&#8217;s small, but the intervention proved the geometry: walk along the curve, the model behaves; cut across it, the model breaks. That&#8217;s the difference between geometry as decoration and geometry as cause.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!W1rZ!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fcdadb1-c807-4657-9f2d-5662119e6938_1200x896.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!W1rZ!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fcdadb1-c807-4657-9f2d-5662119e6938_1200x896.png 424w, https://substackcdn.com/image/fetch/$s_!W1rZ!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fcdadb1-c807-4657-9f2d-5662119e6938_1200x896.png 848w, https://substackcdn.com/image/fetch/$s_!W1rZ!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fcdadb1-c807-4657-9f2d-5662119e6938_1200x896.png 1272w, https://substackcdn.com/image/fetch/$s_!W1rZ!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fcdadb1-c807-4657-9f2d-5662119e6938_1200x896.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!W1rZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fcdadb1-c807-4657-9f2d-5662119e6938_1200x896.png" width="1200" height="896" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0fcdadb1-c807-4657-9f2d-5662119e6938_1200x896.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:896,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:768503,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://rundatarun.io/i/197360370?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fcdadb1-c807-4657-9f2d-5662119e6938_1200x896.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!W1rZ!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fcdadb1-c807-4657-9f2d-5662119e6938_1200x896.png 424w, https://substackcdn.com/image/fetch/$s_!W1rZ!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fcdadb1-c807-4657-9f2d-5662119e6938_1200x896.png 848w, https://substackcdn.com/image/fetch/$s_!W1rZ!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fcdadb1-c807-4657-9f2d-5662119e6938_1200x896.png 1272w, https://substackcdn.com/image/fetch/$s_!W1rZ!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0fcdadb1-c807-4657-9f2d-5662119e6938_1200x896.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>The same lens shows up in their other work. Months and days form circles inside language models. Colors organize by hue and brightness. I <a href="https://rundatarun.io/p/the-specialist-is-now-you">walked through one of Goodfire&#8217;s biology pipelines</a> a few weeks back, where the same techniques surface features in a DNA model that look like splice sites and regulatory regions. The curved-geometry view is becoming their signature.</p><p>The harder claim, and the more important one, is what they say about sparse autoencoders. SAEs are the bet Anthropic, OpenAI, and DeepMind have all made on how to read large models. Goodfire argues SAEs <strong>break continuous structure into disconnected pieces</strong>. Their example: words ending in &#8220;-ore&#8221; form one smooth curve in the model&#8217;s internal space, and SAEs shatter that curve into a handful of unrelated features. The unity disappears.</p><blockquote><p>If that critique holds for big models, a meaningful slice of current AI safety research is studying artifacts of its own tools, not the model.</p></blockquote><h2><strong>What&#8217;s oversold</strong></h2><p>The framing, &#8220;the world inside neural networks,&#8221; does more work than the evidence supports. The paper smuggles in a big claim, that models contain a faithful map of reality, which is hard to disprove because nobody knows what would count against it.</p><p>What Goodfire actually showed is narrower and more useful. <strong>Representations are curved. The curves are causal. Tools that assume straightness are leaving capability on the table.</strong> That&#8217;s enough. The cosmic framing is marketing.</p><p>Two real gaps the paper doesn&#8217;t address:</p><ul><li><p><strong>Does it scale?</strong> They show the geometry is causal for one toy model. Does the same picture hold for a 70-billion-parameter language model? Open question.</p></li><li><p><strong>Is it the same picture across models?</strong> If different models trained on the same data find the same curves, the geometry is approximating something real about the world. If not, the curves are model artifacts and the philosophy crumbles.</p></li></ul><p>Both questions are answerable. Neither is in the paper.</p><h2><strong>Why this is worth watching anyway</strong></h2><p>Two reasons.</p><p>One, it reframes the tooling debate. The interpretability community has been arguing about which kind of feature dictionary to build. Goodfire is asking whether a dictionary is even the right object. A map of curves wants different math, different methods, different papers.</p><p>Two, the <strong>parallel with biology is getting hard to dismiss</strong>. <a href="https://www.quantamagazine.org/the-brain-maps-out-ideas-and-memories-like-spaces-20190114/">Grid cells, place cells, and head-direction cells</a> in mammalian brains encode space as exactly the kind of curved structure Goodfire is finding inside artificial networks. That work won <a href="https://www.nobelprize.org/prizes/medicine/2014/press-release/">the 2014 Nobel Prize in Physiology</a>. When evolved biology and trained silicon land on the same shape, the convergence is worth taking seriously.</p><div><hr></div><p>A year ago I wrote that <a href="https://rundatarun.io/p/the-race-to-understand-ais-black">interpretability was the race we couldn&#8217;t afford to lose</a>. Goodfire&#8217;s work is what running that race looks like when it goes well.</p><p>This is an important direction, half-marketed, and the next year of interpretability research will tell us whether the curved-geometry view replaces the feature-dictionary view or merges with it.</p><p>Watch the scaling question. Watch whether somebody bigger than Goodfire bets on this lens.</p><p>If they&#8217;re right, a lot of recent activation-steering work is about to age badly.</p><div><hr></div><p><em>Around the Corner: short reviews of ideas worth watching. Opt-in section, not part of the weekly Run Data Run email. <a href="https://rundatarun.io/subscribe">Subscribe to the main list</a> for longer essays.</em></p>]]></content:encoded></item><item><title><![CDATA[Three Harnesses, Three Characters, One Working Week]]></title><description><![CDATA[Choose your harness every six months. Don't let it choose you.]]></description><link>https://rundatarun.io/p/three-harnesses-three-characters</link><guid isPermaLink="false">https://rundatarun.io/p/three-harnesses-three-characters</guid><dc:creator><![CDATA[Justin Johnson]]></dc:creator><pubDate>Tue, 12 May 2026 10:48:30 GMT</pubDate><enclosure url="https://substackcdn.com/image/fetch/$s_!o630!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ad8c308-f2f2-420b-8b11-d8d283b3227a_1376x768.png" length="0" type="image/jpeg"/><content:encoded><![CDATA[<div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!o630!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ad8c308-f2f2-420b-8b11-d8d283b3227a_1376x768.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!o630!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ad8c308-f2f2-420b-8b11-d8d283b3227a_1376x768.png 424w, https://substackcdn.com/image/fetch/$s_!o630!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ad8c308-f2f2-420b-8b11-d8d283b3227a_1376x768.png 848w, https://substackcdn.com/image/fetch/$s_!o630!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ad8c308-f2f2-420b-8b11-d8d283b3227a_1376x768.png 1272w, https://substackcdn.com/image/fetch/$s_!o630!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ad8c308-f2f2-420b-8b11-d8d283b3227a_1376x768.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!o630!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ad8c308-f2f2-420b-8b11-d8d283b3227a_1376x768.png" width="1376" height="768" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/0ad8c308-f2f2-420b-8b11-d8d283b3227a_1376x768.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:768,&quot;width&quot;:1376,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:516237,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:false,&quot;topImage&quot;:true,&quot;internalRedirect&quot;:&quot;https://rundatarun.io/i/197334353?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ad8c308-f2f2-420b-8b11-d8d283b3227a_1376x768.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!o630!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ad8c308-f2f2-420b-8b11-d8d283b3227a_1376x768.png 424w, https://substackcdn.com/image/fetch/$s_!o630!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ad8c308-f2f2-420b-8b11-d8d283b3227a_1376x768.png 848w, https://substackcdn.com/image/fetch/$s_!o630!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ad8c308-f2f2-420b-8b11-d8d283b3227a_1376x768.png 1272w, https://substackcdn.com/image/fetch/$s_!o630!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0ad8c308-f2f2-420b-8b11-d8d283b3227a_1376x768.png 1456w" sizes="100vw" fetchpriority="high"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>I&#8217;m writing this in Claude Code. That&#8217;s not a credibility flex. It&#8217;s a working confession before the rest of the piece reads as scoring. For the past year I&#8217;ve used all three of the tools I&#8217;m about to compare. Claude Code is my daily driver. I call Codex from inside it often, and I keep OpenCode in regular rotation on purpose, as a discipline for staying sharp to the alternatives. Take the rest accordingly.</p><blockquote><p><strong>I don&#8217;t want to wake up in two years and discover I let myself get lured into one of them because everyone else did.</strong></p></blockquote><p>The three I&#8217;ve used as daily tools for the last year are Codex CLI from OpenAI, Claude Code from Anthropic, and OpenCode, an open-source project from sst.dev. I wrote <a href="https://rundatarun.io/p/a-deep-dive-into-ai-coding-assistants">a version of this comparison earlier</a>; the field has reshaped enough that the old piece reads like a different category of tool. All three shipped meaningful updates in the last 90 days. All three look more alike on a feature checklist than they did three months ago. All three are still, at the architectural level, doing fundamentally different things. The temptation when you stare at the feature list is to call that convergence. The temptation when you sit inside one of them is to call it tribalism. Neither read is right. The field is neither converging nor consolidating.</p><p>What I&#8217;m not comparing is a much longer list. The harness race has the same shape as the model race: too many credible players for any one person to evaluate fairly, with a new entrant every week. AWS, Google, Moonshot, Alibaba, Mistral, Block, Cognition, Replit, Sourcegraph, the Cline and Continue and Aider open-source lineage, plus a dozen others all shipped serious harnesses in the last year. The full inventory is in the appendix at the end. This piece stays narrow on purpose, on the three I actually run.</p><p>This is not a buyer&#8217;s guide. I&#8217;m not picking a platform for an engineering team. I&#8217;m describing how three different tools fit into the long-horizon work of one person who writes books, drafts essays, builds models, cleans data, and ships creative work, sometimes in the same week, often in the same day.</p><h2><strong>How I actually use them</strong></h2><p>Most of my work runs through Claude Code. The book, the blog drafts, the long-horizon research, model building, all of it lives in a tool that remembers what I corrected last week and carries voice rules across hundreds of sessions. I call Codex from inside Claude Code when the work shape changes: forty-five minutes on a notebook, a feature pipeline to iterate on, a quick statistical test. Codex is terse and decisive, and the terseness wins on tasks I can name in a sentence and finish in a session. OpenCode sits in regular rotation on lower-stakes work, on purpose. Once a quarter I&#8217;ll move a small project entirely into it for a week. The reason is fluency. The risk for anyone with a daily driver is that the daily driver becomes the only thing you can see. Six months from now Claude Code might still be the right answer. It might not. The way I&#8217;ll know is by keeping the alternatives warm enough to feel the difference when I switch back.</p><blockquote><p><strong>Choose your tool every six months. Don&#8217;t let your tool choose you.</strong></p></blockquote><p>The point isn&#8217;t that you need three tools. Each one is shaped for a different work shape, and a long-horizon worker with multiple work shapes ends up running more than one. A team picking a single platform is a different problem with a different answer. This is just how a person who has to ship across a book, a blog, a model, and a creative draft has settled in.</p><h2><strong>What they share</strong></h2><p>Six weeks ago you could draw a clean architectural line between these three. Today the line is smudged. All three now have cascading config files (CLAUDE.md, AGENTS.md, both). All three have a marketplace pattern for plugins or skills. All three have multi-agent worktrees, persisted-session primitives, and MCP-or-equivalent routing. Codex shipped its Chrome extension and Claude Code shipped its enterprise MCP OAuth update on the same day this month.</p><blockquote><p><strong>Read only the changelogs and you&#8217;d conclude these are turning into the same product. Use them for a week each and you&#8217;d conclude something else.</strong></p></blockquote><h2><strong>Where they actually differentiate</strong></h2><blockquote><p><strong>The differentiation isn&#8217;t in the features. It&#8217;s in what each one is shaped to do well.</strong></p></blockquote><p>Claude Code is built for long-horizon coherence. The 1M-token context, the skills-plus-auto-memory loop, the durable cron, the cascading config: a stack designed around running unattended for hours or days. Loops persist across machine restarts via LaunchAgents, and a <code>/loop</code> primitive lets a task self-pace its own intervals when you don&#8217;t want to pick one upfront. Skills live as versioned markdown files with frontmatter that controls when they auto-trigger and what context they pull in, with a marketplace layered on top of that local model. Memory cascades from user-global rules to machine-specific overrides to project instructions, with an auto-memory layer that captures corrections and surfaces them in future sessions so you don&#8217;t re-explain the same preference every Monday. I wrote about the broader version of that memory pattern, vault-as-brain underneath the harness, in <a href="https://rundatarun.io/p/from-the-vault-literally">From the Vault</a>. The verbosity, the premium pricing, and the heaviness on quick atomic tasks all follow from the same commitment.</p><p>Codex CLI is built for per-session bundled productivity. Terse by default, marketplace plugins shaped around tasks you can name in a sentence, optimized for the &#8220;forty-five minutes and move on&#8221; rhythm. Codex closed part of the gap on long-running work at the end of April with <code>/goal</code>, an OpenAI-endorsed Ralph loop that runs a single objective through plan-act-test-review cycles for hours, with pause/resume/clear from the terminal. It&#8217;s a different shape than Claude Code&#8217;s durable cron: one objective held until done, not a schedule of recurring jobs. Plugins bundle a named workflow with its prompts and permissions, so the unit of reuse is a task you can describe in a sentence rather than a long-running role. Memory is AGENTS.md-cascading, same shape as the others, but nothing tracks what you corrected last week across sessions. The compactness is the design.</p><blockquote><p><strong>Open, finish, close.</strong></p></blockquote><p>OpenCode is built for substrate freedom. MIT-licensed, 75+ providers natively, dual Build/Plan toggle, no permission required from any single model lab to ship. The provider abstraction is the decision everything else hangs off. Skills and memory are thinner than the other two on purpose, because every commitment to a particular memory shape or skill format becomes a coupling to the model that has to read it. The Build/Plan toggle handles in-session structure: Plan reasons without acting, Build acts within the plan. Durable loops and cross-session memory are left to the operator to wire up. The cost is more glue. The benefit is that no single model lab can deprecate your tool out from under you.</p><div class="captioned-image-container"><figure><a class="image-link image2 is-viewable-img" target="_blank" href="https://substackcdn.com/image/fetch/$s_!E76M!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ef4af2f-f16d-427a-910e-93c1695dc382_1200x896.png" data-component-name="Image2ToDOM"><div class="image2-inset"><picture><source type="image/webp" srcset="https://substackcdn.com/image/fetch/$s_!E76M!,w_424,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ef4af2f-f16d-427a-910e-93c1695dc382_1200x896.png 424w, https://substackcdn.com/image/fetch/$s_!E76M!,w_848,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ef4af2f-f16d-427a-910e-93c1695dc382_1200x896.png 848w, https://substackcdn.com/image/fetch/$s_!E76M!,w_1272,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ef4af2f-f16d-427a-910e-93c1695dc382_1200x896.png 1272w, https://substackcdn.com/image/fetch/$s_!E76M!,w_1456,c_limit,f_webp,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ef4af2f-f16d-427a-910e-93c1695dc382_1200x896.png 1456w" sizes="100vw"><img src="https://substackcdn.com/image/fetch/$s_!E76M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ef4af2f-f16d-427a-910e-93c1695dc382_1200x896.png" width="1200" height="896" data-attrs="{&quot;src&quot;:&quot;https://substack-post-media.s3.amazonaws.com/public/images/3ef4af2f-f16d-427a-910e-93c1695dc382_1200x896.png&quot;,&quot;srcNoWatermark&quot;:null,&quot;fullscreen&quot;:null,&quot;imageSize&quot;:null,&quot;height&quot;:896,&quot;width&quot;:1200,&quot;resizeWidth&quot;:null,&quot;bytes&quot;:674039,&quot;alt&quot;:null,&quot;title&quot;:null,&quot;type&quot;:&quot;image/png&quot;,&quot;href&quot;:null,&quot;belowTheFold&quot;:true,&quot;topImage&quot;:false,&quot;internalRedirect&quot;:&quot;https://rundatarun.io/i/197334353?img=https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ef4af2f-f16d-427a-910e-93c1695dc382_1200x896.png&quot;,&quot;isProcessing&quot;:false,&quot;align&quot;:null,&quot;offset&quot;:false}" class="sizing-normal" alt="" srcset="https://substackcdn.com/image/fetch/$s_!E76M!,w_424,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ef4af2f-f16d-427a-910e-93c1695dc382_1200x896.png 424w, https://substackcdn.com/image/fetch/$s_!E76M!,w_848,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ef4af2f-f16d-427a-910e-93c1695dc382_1200x896.png 848w, https://substackcdn.com/image/fetch/$s_!E76M!,w_1272,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ef4af2f-f16d-427a-910e-93c1695dc382_1200x896.png 1272w, https://substackcdn.com/image/fetch/$s_!E76M!,w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F3ef4af2f-f16d-427a-910e-93c1695dc382_1200x896.png 1456w" sizes="100vw" loading="lazy"></picture><div class="image-link-expand"><div class="pencraft pc-display-flex pc-gap-8 pc-reset"><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container restack-image"><svg role="img" width="20" height="20" viewBox="0 0 20 20" fill="none" stroke-width="1.5" stroke="var(--color-fg-primary)" stroke-linecap="round" stroke-linejoin="round" xmlns="http://www.w3.org/2000/svg"><g><title></title><path d="M2.53001 7.81595C3.49179 4.73911 6.43281 2.5 9.91173 2.5C13.1684 2.5 15.9537 4.46214 17.0852 7.23684L17.6179 8.67647M17.6179 8.67647L18.5002 4.26471M17.6179 8.67647L13.6473 6.91176M17.4995 12.1841C16.5378 15.2609 13.5967 17.5 10.1178 17.5C6.86118 17.5 4.07589 15.5379 2.94432 12.7632L2.41165 11.3235M2.41165 11.3235L1.5293 15.7353M2.41165 11.3235L6.38224 13.0882"></path></g></svg></button><button tabindex="0" type="button" class="pencraft pc-reset pencraft icon-container view-image"><svg xmlns="http://www.w3.org/2000/svg" width="20" height="20" viewBox="0 0 24 24" fill="none" stroke="currentColor" stroke-width="2" stroke-linecap="round" stroke-linejoin="round" class="lucide lucide-maximize2 lucide-maximize-2"><polyline points="15 3 21 3 21 9"></polyline><polyline points="9 21 3 21 3 15"></polyline><line x1="21" x2="14" y1="3" y2="10"></line><line x1="3" x2="10" y1="21" y2="14"></line></svg></button></div></div></div></a></figure></div><p>These three commitments don&#8217;t reduce to a feature checklist. They show up in the texture of using each one for a week. The feature-by-feature breakdown, with benchmark numbers and the 90-day shipping cadence per harness, lives in <a href="https://ai.rundatarun.io/AI-Development-Agents/codex-vs-claude-code-vs-opencode">the AIXplore companion to this piece</a>.</p><h2><strong>Economics</strong></h2><p>Pricing across all three rotates faster than this paragraph will age. ChatGPT Plus rebalanced its Codex allocation in April. Claude Code&#8217;s Pro tier may lose access entirely. OpenCode launched a $10/month Go subscription this week. None of these prices will hold for 90 days. What&#8217;s stable is the pattern: gateway routing has become a real economic lever. A handful of shell aliases that swap a gateway target before launching the harness can route voice-sensitive drafting to the premium model whose character you want, bulk file scanning to a cheaper open-weight model, and sensitive code to a self-hosted endpoint.</p><blockquote><p><strong>The harness stays constant. The model rotates.</strong></p></blockquote><p>Bills drop 40% to 80% versus running everything against the premium tier. OpenCode does this natively. Claude Code does it through the gateway pattern. Codex does it more by accident than design, but it does it.</p><h2><strong>The model still matters, for a non-obvious reason</strong></h2><p>There was an analysis that made the rounds in March, after a leak exposed the source code of one of these tools. The argument was that the system is mostly deterministic software, not AI. Most of the value, the argument went, comes from how the tool routes context, dispatches its sub-tools, and shapes what the model is allowed to see. The model is a small, expensive component nested inside a much larger software system.</p><p>That&#8217;s true. It&#8217;s also incomplete in a way that matters for anyone choosing one of these tools.</p><p>What the leak actually showed is that the tool is the lens. It&#8217;s a lens shaped around a particular model&#8217;s character. Claude Code&#8217;s long-horizon coherence depends on a model that holds context coherently over hours. Codex&#8217;s atomic-task speed depends on a model that produces short, decisive answers without thinking out loud. OpenCode&#8217;s freedom-to-choose story depends on whatever model you point it at being able to follow the tool&#8217;s protocols at all.</p><blockquote><p><strong>Each tool is doing the work that lets a particular model character be useful at scale.</strong></p></blockquote><p>Picking one of these is partly picking which model character you trust to run while you&#8217;re not watching. The next question is whether you need to be there at all.</p><h2><strong>Sidebar: autonomous agents are a different category</strong></h2><p>A separate category is starting to show up at the edges of this conversation, and it&#8217;s worth naming because the boundary is blurring. Autonomous frameworks like OpenClaw and Hermes run as fleets of agents without per-session human input, optimized for long-horizon unsupervised execution. The three tools I described above all assume a human is in the loop, ready to interrupt, ready to course-correct. The autonomous category does not. That&#8217;s a different reliability problem with a different set of failure modes. A security incident in one of the open-source autonomous frameworks earlier this month is the cautionary tale that makes the distinction visible. Most builder-leaders are currently shipping the supervised category into production and prototyping the autonomous one. A future piece will cover that category specifically.</p><div><hr></div><p>If you want the deeper technical breakdown of the three supervised tools, with feature-by-feature differences, benchmark numbers, and the architectural detail of where each one is going next, that lives in <a href="https://ai.rundatarun.io/AI-Development-Agents/codex-vs-claude-code-vs-opencode">the AIXplore version of this piece</a>.</p><h2><strong>Appendix: the rest of the field</strong></h2><p>I haven&#8217;t put serious hands on any of these in the past year, which is why this piece doesn&#8217;t compare them. Each one is shipping, has real users, and carries a coherent point of view about what a coding agent should be.</p><ul><li><p><strong>AWS</strong>: Kiro, Amazon Q Developer</p></li><li><p><strong>Google</strong>: Jules, Gemini CLI, Antigravity (parallel agents with a built-in browser)</p></li><li><p><strong>Moonshot</strong>: Kimi CLI</p></li><li><p><strong>Alibaba</strong>: Qwen Code</p></li><li><p><strong>Mistral</strong>: Vibe</p></li><li><p><strong>Cursor and Windsurf</strong>: both shipping CLI modes alongside their IDEs</p></li><li><p><strong>Open-source IDE extensions</strong>: Cline, Roo Code, Continue.dev</p></li><li><p><strong>Aider</strong>: longest-running terminal agent, the one most of these others trace back to</p></li><li><p><strong>Block</strong>: Goose</p></li><li><p><strong>Charm</strong>: Crush</p></li><li><p><strong>Plandex, Replit Agent, Devin (Cognition), Zed AI, Sourcegraph Amp, Augment, Droid, iFlow, Kilo, Warp 2.0, GitHub Copilot CLI</strong></p></li></ul><p>A comparison written without hands-on time is what makes most of these roundups useless. So this piece stayed narrow on purpose. If your daily driver is one of the above, I&#8217;d be curious how it compares.</p>]]></content:encoded></item></channel></rss>