Run Data Run: Around the Corner

The skill that edits its own instructions

Justin Johnson — Fri, 05 Jun 2026 01:39:26 GMT

One of the routines I run a few times a day rewrote a line of its own instructions this week, and I let it.

The routine checks on a fleet of small agents I keep running, fixes what it knows how to fix, and flags the rest for me. The new part is the last step. Before it signs off, it reads back over the run and asks one question: what did this teach me that my instructions did not already cover? When the answer is real, it edits the checklist it works from, so the next run starts a little smarter.

It did not change the model. It changed its skill, the separate written set of instructions a tool loads when the task comes up. I thought this was a clever trick I had built. Then I went looking, and found half the field had built it too.

Why skills are the unit

A skill is less exotic than it sounds. It is a written set of instructions, plus any helper files, that an assistant pulls off the shelf when a task matches: a checklist for triaging your inbox, a runbook for closing the monthly books, the house style your team writes in. Claude Code keeps each one in its own folder and loads it only when it is relevant.

Here is why it reaches anyone who does not write code. The model is rented, and it gets swapped for a better one every few months. The skill is the part you own. It is where your specific know-how lives, the institutional memory that survives the upgrade. Most teams pour that knowledge into documents nobody opens twice. A skill is the same knowledge in a form the assistant actually uses, every time the work comes up. And it has crossed from a Claude Code feature to an open standard: the same plain-text file now runs across Codex, Cursor, Copilot, and more than thirty other tools.

My clever trick wasn't clever

So I swept the last thirty days to see who else was doing this. The answer was humbling and useful: nearly everyone.

The pattern I thought I invented is written up as a recipe, a reflection step that runs after a skill is used, asks whether it helped, and proposes an edit to its own file. Anthropic's own skill builder does a sharper version, splitting your examples into train and test sets and keeping only the change that scores better on the held-out half. Microsoft's SkillOpt tunes a skill's written instructions the way you would train a model, and its standout result is the one I care about: a skill tuned inside one tool kept almost all of its gain when they moved it to another. The know-how lived in the skill, not the software. It is already shipping inside products that crystallize skills as you work.

The thing I built to make my own setup compound turned out to be a pattern the whole field is converging on. My problem was not unique, which is exactly why the answer is worth keeping.

The scale is the real headline. One directory has now scraped 1.6 million of these skill files off public GitHub, up from the roughly 790,000 a research team catalogued six months ago. Which is also the catch.

What's oversold

Two things to hold back on, both visible in that same sweep.

The flood is real, and most of it is noise. The ecosystem went from empty to crowded in about six months, and the directories admit most skills barely trigger or quietly burn context. A skill that edits itself inside that flood does not automatically improve. It compounds whatever it already was, noise included, unless you gate the edits.

Write-once-run-everywhere is also sold harder than it ships. The open standard is real, but the formats are still converging, not interchangeable. A skill that sings in one tool can stumble in the next.

Why this is worth watching

I have written before that the real skill is delegation, not automation, knowing what to hand off and when to step back in. A skill that edits itself is the next turn of that screw, and the sweep handed me the guards that separate the durable version from the noise. Don't edit on a fluke: a problem has to recur before it earns a permanent change. Prove the new rule works: the added check has to catch the thing it was written for before it stays. Never re-add what you deleted. None of that is exotic. It is what you would want from a sharp junior teammate keeping the runbook current.

That loop is the real point, and it is bigger than skills. I write up something I think is clever, sweep the field to find the dozen people who already hit it, take the best of what they worked out, fold it back into my own setup, and share the result forward. The writing and the searching are not separate from the building. They are how the building compounds. This post is that loop turning once.

The systems getting the press rewrite their own code against a scoreboard. The skills that will run your operations next year just keep better notes, in a file you can read, borrow the best ideas from everyone else, and get a little sharper every time they run.

Around the Corner: short reviews of ideas worth watching. Opt-in section, not part of the weekly Run Data Run email. [Subscribe to the main list](https://rundatarun.io/subscribe) for longer essays.Subscribe to the main list for longer essays.*

Opus 4.8 and Workflows - One Careful Pass Is No Longer the Default

Justin Johnson — Fri, 29 May 2026 09:14:52 GMT

Anthropic shipped Opus 4.8 yesterday. The model bump alone is not the story.

The story is that two other things shipped beside it. A new Claude Code primitive called Dynamic Workflows, which lets Claude write its own orchestration scripts and fan out to dozens of parallel subagents with adversarial verifiers built in. And a 3x cut to Opus Fast pricing, from $30/$150 per million tokens down to $10/$50, roughly 2.5x faster on top of it.

Those three changes are the same change. Anthropic just repositioned the unit of agentic work, and the pricing finally allows what the tooling implies.

What’s actually in 4.8

The model card is doing the usual benchmarks-go-up dance, but the prompting and runtime changes are where the day-to-day differences land.

The biggest is the move to adaptive reasoning only. MAX_THINKING_TOKENS is now ignored, and CLAUDE_CODE_DISABLE_ADAPTIVE_THINKING is gone. The model decides how hard to think. To pin behavior you use the new /effort slash command (or --effort flag) across low / medium / high / xhigh / max. Claude Code defaults to xhigh for coding work; claude.ai and Cowork default to high. For a one-off deep pass without raising the whole session’s effort, drop the literal word ultrathink into the prompt and that turn alone reasons harder.

Four new or refreshed slash commands round it out. /ultrareview runs a senior-engineer review pass over the diff. /simplify does a refinement pass on recently modified code. /focus hides intermediate work and shows only the final output. /fewer-permission-prompts scans the session and writes a safer allowlist into settings.json so the harness stops interrupting you on read-only bash and MCP calls.

The Max plan gets the 1M context window by default, and the Fast mode price cut to $10 in / $50 out per million tokens makes the difference between Fast and default Opus closer to a latency choice than a budget one. There is also a research-preview Auto mode behind Shift+Tab that auto-approves safe actions and pauses on risky ones, aimed at long-running tasks where you want to walk away.

None of that is revolutionary by itself. The change in posture is.

What Dynamic Workflows actually is

A Dynamic Workflow is a JavaScript orchestration script that the model writes for itself. The script does not call the model directly. It calls four primitives that the harness wires into the session: agent() spawns a subagent and returns its result, parallel() fans tasks out concurrently with a barrier, pipeline() streams items through multiple stages without barriers between them, and phase() groups subagent calls under a progress label.

Two activation modes ship with it. Explicit — you say “create a workflow to audit this codebase for X” and Claude designs and runs the script. Implicit — you flip the ultracode setting on and Claude evaluates every task as a workflow candidate, reaching for fan-out instead of single passes by default. Ultracode is off out of the box, and the docs are clear about why. It burns tokens fast.

The interesting part is what gets baked in. Schema validation through a structured-output tool means subagents return validated objects, not strings you have to parse. Workflows resume from a prior runId, so an edit to your script doesn’t re-run the agents that didn’t change. There is a 1,000-agent lifetime cap per workflow as a runaway-loop backstop. And the documented “quality patterns” — adversarial verify, judge panel, loop-until-dry, multi-modal sweep, completeness critic — show Anthropic’s own hand on what good fan-out looks like.

The most telling pattern is the adversarial verify. Spawn three independent skeptics per finding, prompt each to refute it, kill the finding if a majority succeed. Anthropic is not selling fan-out as more answers. They are selling it as more checks.

This is the part that changes how you build.

The price math is the whole story

The orchestration-first pattern has been technically possible for a year. LangGraph wired it up. CrewAI wired it up. The reason almost nobody runs it as a default is the bill. Fifty parallel subagents on Opus 4.7 Fast was $30 in and $150 out per million tokens. A serious review pass on a real pull request would burn through a tank of compute and produce something a senior engineer could have written by hand in less time.

Opus 4.8 Fast is $10 in and $50 out. The same fan-out is now a third of the cost and 2.5x faster.

That is the number that changes behavior. Iteration loops that used to be run this once, pray it found the bug become run it three times with different angles and trust the intersection. Discovery sweeps that used to ship as a single grep become five finders with different lenses, deduped at the union. Verifier panels that used to be cosplay become a default.

Fast at $10/$50 makes the fifty-agent review pass look like a Tuesday, not a stunt.

What this looks like in practice

I have been running a version of this pattern on my own homelab since February. Eight named agents probe their own state in parallel, a separate evaluator scores their last forty-eight hours of output against a scope file, safe idempotent fixes auto-apply, and only the human-decision items surface for me. The expensive part was never the orchestration. It was the cost of getting eight independent passes to cohere before I trusted the verifier.

Anthropic’s own review-changes example shows the same shape with the rough edges sanded off. Dimensions fan out: bugs, performance, security, reuse, tests. Each dimension yields findings. Each finding is handed to a panel of independent skeptics whose prompt is try to refute this. A finding survives only if a majority of skeptics fail to refute. It is the same trick a good engineering org runs at code review, ported into the model layer and budgeted in tokens instead of senior-engineer hours.

What’s oversold

Two honest caveats.

The new skill being priced is not the orchestration script. It is the verifier. A fan-out that finds fifty plausible bugs and forwards all fifty is worse than the single careful pass it replaced, because it shifts the verification burden onto a human who now has to triage noise instead of read code. The workflows post handwaves the verifier as a prompt. Most builders I know do not have a verifier discipline yet, and the tooling for one is thin.

And ultracode, the setting that makes Claude reach for a workflow on every task without being asked, ships off by default. Anthropic’s documentation flags why. Fan-out burns quota fast, and there is no governance layer that says do not orchestrate this task. The dispatch problem, deciding when a workflow is the right shape and when a single careful pass is, is unsolved. Right now it lives in your head.

Why this is worth watching anyway

Two reasons.

The agentic frameworks built between 2024 and now (LangGraph, CrewAI, Autogen) were filling the gap where the model vendor did not ship orchestration. That gap just closed. The bridge between one model call and agentic system is now a tool the model writes for itself, in JavaScript, with caching and resume baked in. Whatever those frameworks were going to charge for over the next eighteen months, the ceiling on that price just dropped.

And the orchestration-first pattern was already where AI engineering was heading. Evals are the new bottleneck because at some point the question stops being is the model good enough and starts being how confident am I that this particular answer is right. Workflows give you a vocabulary for spending compute on that second question instead of just the first. The vendor that ships the verifier primitive first sets the pattern everyone else copies.

If the price of careful goes down, the price of casual goes up. The default question stops being is the model good enough yet. It becomes who is verifying.

Around the Corner: short reviews of ideas worth watching. Opt-in section, not part of the weekly Run Data Run email. Subscribe to the main list for longer essays.

The Number That Predicts When Your Agent Will Break

Justin Johnson — Fri, 22 May 2026 12:12:22 GMT

A new paper asks a question that sounds simple and turns out to have teeth. When a frontier model fails at “reasoning,” what is it failing at?

Their answer is a number. They call it Relational Complexity, and it predicts model failure better than anything else they measured.

The paper, “Evaluating Relational Reasoning in LLMs with REL” from Fesser, Ektefaie, Fang, Kakade, and Zitnik, borrows a construct from cognitive science. Relational Complexity (RC) is the minimum number of entities a system has to hold in mind and bind together at once to take a single reasoning step. “A is taller than B” is RC=2. “A is between B and C” is RC=3. The number climbs as the relations get wider.

The finding is clean and a little grim. As RC goes up, accuracy falls off a cliff, and nothing the authors tried pulled it back.

What they measured

The clever part is the benchmark itself. REL is a generative framework, not a fixed test set. It produces as many problems as you want at any RC level, across three domains the authors deliberately picked to look nothing alike: pattern-completion puzzles (Raven’s matrices), phylogenetic trees in biology, and molecular isomers in chemistry.

Why three unrelated domains? Because that lets them hold everything else constant. Same vocabulary, same input length, same task format, only the RC dial moving. Most reasoning benchmarks can’t separate “the task is harder” from “the task has more words” or “the task is in an unfamiliar domain.” REL can.

Hold vocabulary, length, and format fixed, move only the relational complexity, and watch the accuracy curve bend. That is the whole experiment, and it is enough.

They ran it against Claude Opus 4.5, Gemini 3 Pro, and GPT 5.2.

The numbers

At low complexity, the models look great. The pattern puzzles at RC=1-2 land around 91% accuracy across all three models.

Then it collapses. Scale the matrices up, push RC to 6, and Claude and Gemini drop to roughly 12%. The biology task tells the same story: phylogenetic homoplasy detection runs at 35% with four taxa and falls to 1% at twenty-five taxa.

The authors then did the thing most benchmark papers skip. They ran a regression to check whether RC was actually the driver or just correlated with something else. With collinearity controls in place, RC explained 24 to 44% of the explainable variance. The next-strongest factor topped out at 17%.

It is not input length. It is not domain. It is the binding count
.

And the interventions did almost nothing. Extra test-time compute bought 2 to 3%. In-context examples bought 3 to 6%. Tool use, handing the chemistry model RDKit so it could compute instead of reason, produced a mean recall of 0.094 that got worse as the problem grew.

The gap is structural. You don’t prompt your way out of it.

Why an agent builder should care

I have written before that evals are the new bottleneck and that agent failures cluster around a small set of repeatable mistakes. RC is the missing vocabulary for one whole class of those failures.

Think about what your agent does when it stalls on something that “should” be easy. A cross-document join where it has to reconcile three sources at once. A planning task with four interacting constraints. A loop where it has to hold the output of step two while reasoning about step five. Those are not long tasks or unfamiliar tasks. They are high-RC tasks. The model has to bind several interdependent things simultaneously, and that is the regime where frontier accuracy falls to a coin flip or worse.

When a task needs three or more interdependent variables held in mind at the same time, the failure is not a smarter-model problem. It is a binding problem, and more compute does not fix it.

This reframes the diagnostic. The next time an agent breaks on a task you expected it to handle, the useful question is not “is the model good enough yet.” It is “how many things does this step force the model to bind at once.” If the answer is four or more, you have your explanation, and the fix is architectural, decompose the binding into smaller steps with explicit intermediate state, rather than waiting for a better model.

What I’d hold back on

Two honest caveats, both the authors more or less own.

The tasks are stylized. Raven’s matrices and phylogenetic trees are clean lab instruments, and the jump from “RC in a synthetic tree” to “RC in your production workflow” is assumed, not proven. I would love to see RC mapped onto a naturalistic agent benchmark before treating the number as a planning constant.

And there is no human baseline. Every result frames frontier models as failing, but without a human RC-versus-accuracy curve we cannot tell whether people plateau at RC=5 or sail past it. That would settle whether this is “LLMs are uniquely bad at binding” or “binding is hard for everyone and LLMs are a bit worse.” Different stories, different implications.

The contribution here is not the scary 12%. It is the ruler.

For two years “complex reasoning” has been the phrase practitioners reach for when a model fails and they cannot say why. RC turns that shrug into a measurement.

The generative code is open on GitHub, so you can instantiate REL-style probes against your own agent tasks instead of guessing.

Watch for the human baseline and the naturalistic mapping. If those land, RC stops being a benchmark curiosity and becomes a number you check before you ship an agent into a high-binding workflow.

The models are not getting dumber. We are just learning to name the shape of where they break.

Around the Corner: short reviews of ideas worth watching. Opt-in section, not part of the weekly Run Data Run email. Subscribe to the main list for longer essays.

Neural Networks Don't Think in Straight Lines

Justin Johnson — Tue, 12 May 2026 14:45:36 GMT

A research team at Goodfire trained a tiny neural network to drive a virtual car up a hill. Then they looked inside the network to see how it represented the car’s position.

The answer was not where anyone expected.

Position didn’t live as a clean direction in the network’s internal space. It lived as a curve, threaded through the network’s neurons like a string. Every point on the string corresponded to a real-world position of the car.

When the team nudged the network along that curve, the car moved coherently. When they nudged it in a straight line across the curve, the way almost every modern interpretability tool does, the predictions broke. The car teleported. The simulation produced nonsense. The straight line wandered through regions of the network’s space that the model had never learned to handle.

Their new paper, “The World Inside Neural Networks”, argues this isn’t a quirk of one toy model. It’s how networks actually represent things.

The shape of the problem

Most of what we do to understand or steer large AI models assumes representations are straight. There’s a name for this assumption in the field, the linear representation hypothesis: concepts inside a model live as directions in the network’s internal space, and you can adjust the model’s behavior by moving along those directions.

You see this assumption everywhere. It’s how Anthropic built “Golden Gate Claude”, the version of its model that couldn’t stop talking about a bridge. It’s how researchers find “refusal directions” and “honesty vectors.” It’s how sparse autoencoders (SAEs), the dominant tool for naming what’s inside a model, try to break activity into a clean list of concepts.

Add. Subtract. All of it assumes flat geometry.

If the real structure is curved, every straight-line move is just an approximation along a tangent, and the further you push, the worse the approximation gets.

That would explain a lot of unsolved noise in the field. Why steering tricks work in narrow zones and fall apart at the edges. Why killing one “feature” inside a model often breaks something unrelated. Why so many “we found the X concept” papers don’t reproduce cleanly when somebody else tries.

The field has been working with the wrong shape and getting partial credit for the effort.

What they actually showed

The Mountain Car experiment is the centerpiece. It’s small, but the intervention proved the geometry: walk along the curve, the model behaves; cut across it, the model breaks. That’s the difference between geometry as decoration and geometry as cause.

The same lens shows up in their other work. Months and days form circles inside language models. Colors organize by hue and brightness. I walked through one of Goodfire’s biology pipelines a few weeks back, where the same techniques surface features in a DNA model that look like splice sites and regulatory regions. The curved-geometry view is becoming their signature.

The harder claim, and the more important one, is what they say about sparse autoencoders. SAEs are the bet Anthropic, OpenAI, and DeepMind have all made on how to read large models. Goodfire argues SAEs break continuous structure into disconnected pieces. Their example: words ending in “-ore” form one smooth curve in the model’s internal space, and SAEs shatter that curve into a handful of unrelated features. The unity disappears.

If that critique holds for big models, a meaningful slice of current AI safety research is studying artifacts of its own tools, not the model.

What’s oversold

The framing, “the world inside neural networks,” does more work than the evidence supports. The paper smuggles in a big claim, that models contain a faithful map of reality, which is hard to disprove because nobody knows what would count against it.

What Goodfire actually showed is narrower and more useful. Representations are curved. The curves are causal. Tools that assume straightness are leaving capability on the table. That’s enough. The cosmic framing is marketing.

Two real gaps the paper doesn’t address:

Does it scale? They show the geometry is causal for one toy model. Does the same picture hold for a 70-billion-parameter language model? Open question.
Is it the same picture across models? If different models trained on the same data find the same curves, the geometry is approximating something real about the world. If not, the curves are model artifacts and the philosophy crumbles.

Both questions are answerable. Neither is in the paper.

Why this is worth watching anyway

Two reasons.

One, it reframes the tooling debate. The interpretability community has been arguing about which kind of feature dictionary to build. Goodfire is asking whether a dictionary is even the right object. A map of curves wants different math, different methods, different papers.

Two, the parallel with biology is getting hard to dismiss. Grid cells, place cells, and head-direction cells in mammalian brains encode space as exactly the kind of curved structure Goodfire is finding inside artificial networks. That work won the 2014 Nobel Prize in Physiology. When evolved biology and trained silicon land on the same shape, the convergence is worth taking seriously.

A year ago I wrote that interpretability was the race we couldn’t afford to lose. Goodfire’s work is what running that race looks like when it goes well.

This is an important direction, half-marketed, and the next year of interpretability research will tell us whether the curved-geometry view replaces the feature-dictionary view or merges with it.

Watch the scaling question. Watch whether somebody bigger than Goodfire bets on this lens.

If they’re right, a lot of recent activation-steering work is about to age badly.

Around the Corner: short reviews of ideas worth watching. Opt-in section, not part of the weekly Run Data Run email. Subscribe to the main list for longer essays.