Run Data Run

Sunday Deep Dive: Reckoning Is Not Judgment

Justin Johnson — Sun, 03 May 2026 10:40:16 GMT

Every Sunday I pick one paper or release that’s worth your time, break it apart, and tell you why it matters. No hype. No summaries of summaries. Just the idea, explained.

The Headline

On April 29, Anthropic published BioMysteryBench, a 99-question bioinformatics evaluation built with domain experts. Claude Opus 4.6 matched expert baselines on the routine work. Their unreleased “super model,” which Anthropic refers to as Mythos Preview, occasionally solved problems an expert panel could not. Three weeks earlier, on April 9, Surag Nair and the Genentech computational biology team had published a parallel benchmark, CompBioBench, with the same instinct.

This is the first benchmark cluster I’ve seen that grades bioinformatics the way bioinformatics is actually done. The methodology is the story. The numbers are interesting. The honest reading of what the numbers mean is more interesting still.

The Problem Bioinformatics Poses to Benchmarks

Bioinformatics is a brutal benchmark target.

Most AI evaluations want a single right answer with a clean grading rubric. Bioinformatics workflows almost never look like that. A scRNA-seq analysis can run through Seurat, Scanpy, or a custom pipeline and produce three slightly different cluster assignments, all of them defensible. A variant-calling pipeline can use BWA-MEM, Bowtie2, or minimap2, with GATK, DeepVariant, or Strelka2 downstream. The right answer is not “the cluster” or “the variant.” It’s the experimental finding the data points to.

Bioinformatics work is method-plural and answer-singular. Most benchmarks invert that and reward method conformity instead of biological truth.

Existing science benchmarks dodge this by asking textbook questions. GPQA tests graduate-level multiple choice. HumanEval-bio tests function completion. Both measure recall. Neither measures the messy thing a working bioinformatician actually does on Tuesday morning, which is take a CSV nobody documented, a PI who wants an answer by Friday, and a public dataset that may or may not be the one cited in the methods section.

That’s the gap BMB and CompBioBench are trying to close. It’s a hard problem and they are genuinely trying to move the field.

What BioMysteryBench Did Right

Three design choices set BMB apart from the GPQA-style benchmarks AI labs usually run.

Method-agnostic evaluation. The model gets unrestricted tool access. It can hit NCBI, Ensembl, GEO, download whatever it needs. Anthropic does not score the path the model took. They score whether the answer it landed on matches the ground truth. That single decision carries the rest. You cannot grade a bioinformatician on which aligner they chose, only on whether the answer came out right.

Experimental ground truth, not researcher claims. This is the move that matters most. The right answer for each question is anchored to a verifiable experimental finding, not the conclusion the original paper drew. If a paper claimed gene X drove phenotype Y, but the underlying knockout data showed gene Z, the ground truth is Z. That sidesteps the worst failure mode of literature-based benchmarks, which is grading the model on whether it can recover the human’s interpretation rather than the biology.

Superhuman question generation. Twenty-three of the 99 questions were intentionally beyond the human expert panel. Most benchmarks cap at human ceiling because that’s where the labelers stop. BMB broke that ceiling by using validation notebooks that confirm a signal is in the data without requiring a human to solve the problem first.

The benchmark spans WGS, scRNA-seq, ChIP-seq, metagenomics, proteomics, and metabolomics. Real assays, real noise, real ambiguity. Genentech’s CompBioBench, which actually shipped first on April 9, runs a parallel methodology over a different 100-task spread and arrives at directionally similar numbers. Anthropic explicitly acknowledged it in the BMB write-up. Two industry teams, working in parallel, converged on the same shape of benchmark in the same window. That convergence matters more than either result on its own. The field is doing the work to keep itself honest.

New Doesn’t Mean Better

Here is the line that should give any AI-for-science leader pause.

On April 28, Surag Nair posted an update on CompBioBench that most people scrolled past. Anthropic’s newer Opus 4.7 slightly underperforms Opus 4.6 on CompBioBench. Not a dramatic regression, but a real one, on a domain-specific eval that didn’t exist a month earlier.

This is exactly why benchmarks matter, and exactly why we need more of them. A model card tells you the trajectory is up and to the right on the aggregate evals. A domain-specific benchmark tells you that on this particular slice of biology, the newer model is slightly worse. Both can be true, because frontier models are trained on overlapping but distinct objectives, and per-domain capability moves unevenly across releases.

New doesn’t mean better. If you are running a computational biology team and you upgrade the API call in your pipeline the day a new model ships, you may be regressing on the work that matters most to you and you would not know it without a benchmark like this one.

This is not a knock on Anthropic. They published the data that lets you see the regression. It’s a knock on the assumption that bigger numbers in the model card translate to better answers in your domain. The cure is exactly what BMB and CompBioBench are doing: domain-specific evaluation that moves at the speed of model releases.

The Numbers

Three numbers carry the rest of the story. They are worth pausing on individually because they measure different things.

86% is Opus 4.6’s accuracy on the 76 human-solvable questions, scored at four out of five attempts. That tier is decision-grade. Cell-type identification, gene-knockout detection, pathway inference on standard data. A generally available frontier model handles it with the consistency of a competent postdoc.

94% is Mythos Preview’s reliability on the same routine tier. The next-generation model improves on the routine work, which is what you would expect.

30% is Mythos Preview’s accuracy on the 23 problems the expert panel could not solve. That number is the one that ran on AI Twitter, and it is interesting. The first time I have seen a credible claim that a frontier model solved real scientific problems beyond what a panel of working scientists could solve.

The number Anthropic put quietly in the consistency section is the one that changes the deployment calculus.

44%, the rate at which Mythos Preview’s wins on the human-difficult tier replicated across multiple attempts.

When the model solved a hard problem, it reproduced that solve less than half the time. The wins count. They are also brittle. Roughly six out of ten times the model arrived at the right answer once and could not reliably get there again. Anthropic’s own framing for this is that the model “stumbles onto” these answers. That is a careful word. It admits the model is occasionally getting the right answer for reasons even Anthropic cannot fully reconstruct.

Brittle wins are not the same as no wins. They are also not capability you can deploy in the path of a research decision.

Routine bioinformatics tasks are decision-grade. Hard problems are tantalizing and brittle. The replication rate is the deployment-relevant number.

Reckoning Is Not Judgment

Melanie Mitchell wrote the cleanest philosophical anchor for what BMB is and isn’t measuring. In a February essay she drew a distinction between reckoning and judgment.

Reckoning is calculative prowess. Judgment is a form of dispassionate deliberative thought, grounded in ethical commitment and responsible action.

Reckoning is the thing AI systems excel at. Pattern matching, retrieval, inference over large corpora. Judgment is knowing which question to ask. Knowing which answer to trust. Knowing when to stop.

BMB is a reckoning benchmark. It measures whether a model can take a CSV, hit the right databases, run the right inference, and land on the experimental finding the data supports. That’s calculative retrieval at scale, and frontier models are now genuinely good at it. The 86% number says so.

What BMB does not measure is judgment. It does not test whether the model recognizes the question is malformed. It does not test whether the model knows the public dataset has six samples mislabeled, which it does, because everyone who works with public datasets knows that one. It does not test whether the model understands why the experiment was designed this way and whether the original design can answer the question being asked.

The bimodality in the BMB numbers maps onto Mitchell’s distinction almost perfectly. The 86% tier is reckoning. The 44% replication on hard problems is what happens when reckoning runs out and judgment is what’s needed. The model occasionally lucks into a judgment-shaped answer through reckoning machinery, and roughly six times out of ten it cannot find that answer again because the machinery never had judgment in it to begin with.

This is the frame that will outlive the benchmark. Whatever the next model scores on the next eval, the question is the same. How much of this is reckoning, and how much is judgment, and which one does the work actually require?

After Melanie Mitchell, Feb 2026. Bioinformatics work happens on both sides of the line. Benchmarks only grade the left.

This is where the free preview ends. Below the fold: what changes in your lab next week, what to watch over the next 90 days, and the bigger pattern this release fits into.

Two Gaps, Not One

Justin Johnson — Thu, 30 Apr 2026 11:03:02 GMT

Patel ran a piece on Decoder this week titled:

“The people do not yearn for automation.”

It’s good. He names a way of seeing the world he calls “software brain”: viewing everything as databases and loops you can run with code. He argues AI has turbocharged that mindset, and that the rest of the country is reacting to it the way you’d expect.

With a hard no.

Then, partway through, he reaches for an example of what software-brained people actually do with their days. He says they pay thousands of dollars a month to set up swarms of OpenClaw agents.

That’s me. I run OpenClaw.

So before anything else: yes. I see opportunities for automation. I write thousands of lines of code. I sit at a laptop and tell agents what to build, and a lot of the time they build it. Patel describes the type accurately.

The type is also relevant for a different reason than the one he’s writing about, and that difference is the entire point of this post.

I’m going to argue Patel is right about the mood, right about the cultural backlash, and pointing at the wrong gap for any executive trying to figure out what to do this quarter.

What Patel gets right

Software brain is a real thing and the cultural rejection of it is a real thing.

The smart-home anecdote is the one I keep coming back to. Apple, Google, and Amazon have spent more than a decade and many billions of dollars trying to make ordinary people care about home automation. Most ordinary people still don’t. They will buy a smart bulb and forget about it. They do not want to instrument their lives.

The polling tells the same story. AI’s favorability is below ICE in some polls. Gen Z’s hopefulness about AI dropped from a bad number last year to a worse one this year. Anger is up. The political violence around data centers is real and ugly and should embarrass anyone in this industry who thinks better marketing fixes it. Patel’s flattening line is the one that does the work:

“That’s why people hate AI. It flattens them.”

He’s also right that the tech industry’s “we just need to tell our story better” answer is delusional. People are using these tools every day. ChatGPT has nine hundred million weekly users. They know what it feels like.

You cannot advertise people out of their own experience.

The piece engages most generously when Patel paraphrases Ezra Klein on Silicon Valley AI types racing to make themselves “legible to the AI.” Feeding the model their files, calendar, email, messages, building persistent memory of their preferences. Patel calls that a doomed ask of regular people, and he’s right. Regular people will not flatten themselves into a database to please an LLM.

That is not the audience the book I’m writing is for.

The other gap

There is a second gap running underneath the cultural one, and Patel’s piece makes it harder to see, not easier.

Andrej Karpathy named it on April 9 of this year in a tweet that got roughly twenty thousand likes by the end of the week. Two groups, he said, speaking past each other about AI. Not skeptics versus believers. Not chatbots versus AGI.

People who have built something with agentic systems on one side. People who have read about them, used the free tier, or watched a demo on the other.

The reply thread surfaced a third group hiding inside the second. Doodlestein called them “people magnifying power with custom tooling, skills, workflows, swarms.” Another reply landed harder: most people in Karpathy’s second group are leaving eighty percent of the capability on the table without knowing it.

Patel and Karpathy are both right. They are also describing different gaps.

Patel’s gap: software brain versus everyone else. Karpathy’s gap: built with agents versus hasn’t.

These are not the same line.

Why they look like the same gap

Both have excited tech people on one side. Both have a population on the other side that finds AI underwhelming or hostile. The merge is easy. The merge is also the trap, because Patel’s piece makes the merge feel responsible.

Here is the trap as a sequence:

Read the Decoder piece.
Conclude AI is mostly hype because regular people don’t like it.
Skip the personal-build step yourself.
Approve the twenty-million-dollar platform contract someone else recommended.
Six months later you are in the McKinsey eighty-eight-percent failure rate, looking for someone to blame.

The cultural gap and the build gap can both be real. One of them can still be the one that decides which side of the next five years your company lands on.

The cultural gap is a decade. The build gap is six months.

Side by side

The cultural gap is real and probably deepening. The build gap is also real and decides whether your org ships agentic systems to production or stalls.

You can be right about the first and wrong about the second at the same time.

Most leaders currently are.

The executive failure mode

“Regular people don’t like AI, therefore AI is overhyped, therefore I don’t have to build.”

That is the most expensive sentence in enterprise AI right now.

McKinsey’s 2025 State of AI says eighty-eight percent of organizations report using AI and thirty-nine percent report capturing meaningful value from it. Gartner says thirty to forty percent of agentic-AI proofs-of-concept get cancelled. An OutSystems report this April, on a survey of nineteen hundred IT leaders, found ninety-four percent worried about agent sprawl across fragmented enterprise systems.

None of those numbers are stories about AI being overhyped. They are stories about leaders trying to deploy what they’ve never personally operated.

The build gap shows up as the failure rate.

Even Gary Marcus, the canonical LLM skeptic, conceded in an April Substack post that Claude Code is the single biggest advance in AI since the LLM, and that it is, quote, not a pure LLM.

Hostile witness. The model is no longer the question. The thing built around the model is.

What the build gap looks like from inside

One paragraph, not a tour.

Karpathy runs an autoresearch loop while he sleeps and reads the output in the morning. Reddit r/ClaudeAI has a twelve-thousand-upvote thread of operators trading folk-culture optimizations they call “caveman tokens.” The word harness hardened into a noun in mainstream developer discourse this quarter. Doodlestein’s third group is real and it is bigger every month: people running custom tooling, skills, agents, swarms.

Yes, those are software-brain people. Patel is right about that. They are also the people whose calibration on what AI can do is correct, because they hold the data the rest of the debate is being conducted without.

If you are a senior leader, your job is not to become one of them. Your job is to know what one of them sees, well enough to direct the work and call bullshit when someone tries to sell you a slide deck.

What Patel is missing about leaders

Patel is writing about consumers and citizens. Both are downstream of policy, mood, and the social contract. Fair game for the cultural critique.

The reader I am writing for is not a consumer or a citizen in this context. They are an executive who has to allocate a budget on Tuesday.

They cannot wait for the cultural gap to close, because the cultural gap is going to be ugly for a decade.

They also cannot reason their way past the build gap, because the build gap is reality-shaped, not narrative-shaped. The eighty-eight-percent failure rate does not care how anyone feels about software brain. It cares whether anyone senior in the org has personally driven a harness and shipped something with it.

This is the move Patel doesn’t make and shouldn’t be expected to make.

He is diagnosing the mood. The book diagnoses the move.

The book in two sentences

Builder-Leader: The AI Exoskeleton That Crosses the Gap is for executives who want to be on the right side of the build gap by Q3.

It does not require you to become an engineer, abandon your polish, or volunteer to be flattened into a database. It requires you to direct a harness, every day for six months, until you can tell what good looks like from the inside.

This is my pretty quiet, nonchalant attempt at an announcement, louder ones, forthcoming…

Preorder Builder-Leader on the book site.

Close

Patel is going to keep being right about the mood. AI will probably get less popular before it gets more.

The cultural gap is a decade.

The build gap is six months.

It is not subject to a vote. It does not move on a news cycle. It is the difference between leaders who can read an AI strategy proposal and tell whether it is real, and leaders who can’t.

By 2027 there will not be many of the second kind left in senior roles at companies that survive the decade.

Patel diagnosed the mood. The book diagnoses the move.

You can be right about both at once. You’d better be.

Sunday Deep Dive: The Specialists Are Coming for the Generalists

Justin Johnson — Sun, 26 Apr 2026 12:17:19 GMT

Every Sunday, I pick one paper or release that’s genuinely worth your time, break it apart, and tell you why it matters. No hype. No summaries of summaries. Just the idea, explained.

The Headline

The loud story this quarter is that an open-weights Chinese model beat Claude Opus on a real coding benchmark. Every tech newsletter ran it.

The quiet story is that the specialists are coming for the generalists, and they’re small enough to run on your laptop.

Hamel Husain, who has trained more applied LLM engineers than anyone I know, put the case in one sentence:

“Open models aren’t always better, but the more narrow your task, the more open models will shine because you can fine tune that model and really differentiate them.”

This week’s deep dive is about the quiet story.

A Quick DeepSeek Refresher

Set the table briefly, because the rest of the post depends on it.

January 2025. DeepSeek-R1 ships. Reasoning model from a Chinese lab matching OpenAI’s o1, open weights, training cost an order of magnitude lower than the closed labs had implied was possible. NVIDIA dropped that day, the market panicked, and the financial story made the front page.

The financial story was the wrong story.

The real story was the permission slip. Every other open-weights lab, Qwen, GLM, MiniMax, Mistral, the Llama group, took the gloves off. By April 2026, that permission slip is showing up everywhere. And the strangest thing about it is that the most interesting consequence isn’t at the frontier.

The Loud Story (Quick Flyby)

The benchmark headline is real. Z.ai’s GLM-5.1 beats GPT-5.4 and Claude Opus 4.6 on SWE-Bench Pro, the most-cited real-world coding benchmark in the field.

That is, by any reasonable measure, “open caught up to closed at the frontier on coding.” It’s a categorical change from where we were a year ago.

And it isn’t a one-off. The release cadence over the two weeks I spent finishing this post:

April 7 — GLM-5.1 lands the SWE-Bench Pro number above.
April 12 — MiniMax M2.7 drops open weights on HuggingFace. 229B MoE, 56% on SWE-Pro.
April 21 — Kimi K2.6 ships GA with 12-hour autonomous coding sessions and 300-agent swarms.
April 22 — Qwen 3.6-27B, a dense 27B model, beats the previous-generation 397B MoE on coding benchmarks.
April 24 — DeepSeek V4 preview drops in two sizes with explicit Claude Code integration. I’ve been running V4 in my own coding agent as a Claude swap-in on real tasks. Results hold up.

Five frontier-grade open releases in under three weeks. If you only read the GLM headline, you missed the cadence.

The interesting story isn’t even at the top of the leaderboard. It’s one tier down.

Meet the Small Models

Two anchors. Pay attention to the second number.

Gemma 4 26B (Google, April 2026). 26 billion parameters total, but only 3.8 billion active per query. Runs on a 16GB consumer GPU, an Apple Silicon Mac with 32GB of RAM, or natively on an iPhone offline (the smaller E2B variant). It got six separate top-300 Hacker News threads in two weeks. One Reddit operator wrote: “Gemma 4 just casually destroyed every model on our leaderboard except Opus 4.6 and GPT-5.2. 31B params, $0.20/run.”

Qwen 3.6 35B-A3B (Alibaba, April 2026). 35 billion parameters total, 3 billion active per query. Apache 2.0 license. The release HN thread, with 1,263 points, was titled: “Qwen3.6-35B-A3B on my laptop drew a better pelican than Claude Opus 4.7.”

The pelican test is a Simon Willison thing. He gives every new model the same prompt: “draw a pelican riding a bicycle, in SVG.” It’s been a useful informal benchmark for the gap between hype and capability. A 35B-total, 3B-active model running on someone’s laptop drawing a better pelican than the trillion-dollar closed-API offering is the kind of moment people remember.

Neither of these models is trying to be the frontier. They’re not chasing GLM-5.1 on SWE-Bench Pro. They’re the new floor.

Why They’re Small (the Architecture)

The number that matters in both descriptions above is “active parameters.” Total params is the file size on disk. Active params is the cost per query. They’re radically different now, and the reason is an architecture called Mixture of Experts.

Here’s the analogy.

Imagine a hospital with 100 specialists on staff. A surgeon. An anesthesiologist. A cardiologist. A radiologist. Ninety-six others. Most patients only ever see three of them on a given day. The surgeon for the operation, the anesthesiologist for the procedure, the recovery nurse afterward. The other 97 don’t show up.

The hospital is “100 doctors big.” But the cost-per-patient is “3 doctors big.”

That’s MoE. Mixture of Experts. The model has a lot of total parameters. But on any given prompt, the network routes through only a small fraction of them. You get the breadth that comes from training a larger network without paying the per-token cost of running it.

Quick glossary

MoE (Mixture of Experts). Model architecture where only a subset of parameters activate per input. Not new (research goes back to the 1990s), but only recently practical at frontier scale.
Active parameters. The ones that actually compute on a given prompt. The cost number that matters.
Total parameters. All the weights stored on disk. The file-size number.
Quantization. Rounding model weights to lower-precision numbers (e.g. 16-bit to 4-bit) to fit them in less memory. TurboQuant, the Google paper from March 2026, was the breakthrough on this for inference-time KV caches. (I covered it in a previous Sunday Deep Dive.)

This is the structural answer to the obvious question: how does a 3B-active model match Claude Opus on real tasks?

It doesn’t. It matches Claude Opus on the slice the model was tuned for. That’s the bridge to the next section.

What They’re Actually Good At

The honest version of the small-model story is task-specific.

Coding (in-domain). Qwen 3.6 35B-A3B beats much larger models on SWE-Bench Pro. The pelican test is the cocktail-party version; the SWE-Bench number is the engineer-meeting version.

On-device and offline. Gemma 4 26B running natively on an iPhone with no API calls and no monthly fees is the privacy story. For applications where data cannot leave the device (regulated industries, personal assistants, anything HIPAA-adjacent), this is no longer aspirational.

Long-context retrieval. TurboQuant compression makes 128K context windows manageable on consumer GPUs. Reading whole codebases, whole legal briefs, whole patient charts on a workstation is now possible without renting cloud time.

Multimodal vision-language. Qwen 3.6 matches Claude Sonnet 4.5 on vision-language tasks despite being roughly one-tenth the size.

What they are not good at, because this is where the post earns trust:

Long-horizon agentic reliability. A 50-step coding task where the model has to maintain context, recover from errors, and not silently give up. Closed frontier models still lead. The same Ahmad Osman thread that broke the GLM-5.1 SWE-Bench Pro number also flagged that GLM-5.1’s 1,700-step autonomous run claim requires verification loops you have to build yourself. Not plug-and-play.
Voice-sensitive long-form writing. The essay you’re reading was drafted by Claude Opus, not Qwen 3.6. The taste-and-rhythm gap on long-form prose is real and won’t close fast.
Adversarial robustness. When inputs are hostile (prompt injection, weird user behavior, adversarial test data), closed labs have invested more in the failure modes.

But notice what’s on each list. The “not good at” list is exactly the cases where you’d route to a generalist anyway. For everything else, the specialization argument starts to look obvious.

The Real Story: Domain Specialists

This is where the post stops being about chatbot LLMs and gets to the real architecture of the next year.

Evo 2 is not a chatbot. It doesn’t talk. It understands DNA, specifically 8,000-letter genomic windows, the building blocks of every cancer mutation in the public ClinVar database. Open weights. The 7B variant runs on a single workstation GPU.

Earlier this week I showed what happens when you actually point Evo 2 at a real clinical problem. Six cancer genes, 4,471 variants, one workstation in a closet, one weekend, and the model beats AlphaMissense, the specialist tool clinicians actually use, on coding variants. And it extends into noncoding territory where AlphaMissense produces no score at all.

The point isn’t that genomics is special. The point is that this pattern, open weights plus workstation hardware plus a domain the model was specifically trained for, is repeating across every field. MedGemma for biomedical text. DeepSeek-Coder for code. AlphaFold 3 for protein structure. The specialists are showing up faster than the generalists can absorb their territory.

Which raises the operator question: how do you actually use them?

The Specialist Is Now You

Justin Johnson — Tue, 21 Apr 2026 11:02:07 GMT

I reproduced Goodfire’s mechanistic variant-effect pipeline on cancer genes over a weekend. One box. Open weights. The payoff shows up whether or not the clinic is ready for it.

A BRCA1 variant lands in front of a clinician. The lab report says “variant of uncertain significance.” The oncologist looks at it, the genetic counselor looks at it, nobody can act. About thirty percent of oncogene variants in ClinVar (the public catalog of human genetic variants and their clinical labels) carry that same shrug. Patient leaves the appointment with no actionable call, no mechanism, no next step.

BRCA1 is a tumor suppressor. When it breaks, inherited breast and ovarian cancer risk goes up. A pathogenic variant calls for surveillance, sometimes surgery. A benign variant calls for a reassuring conversation. A VUS (variant of uncertain significance) calls for neither. The default is to wait until more families with the same variant get sequenced and the label firms up. Decades, in some cases.

This is the gap AlphaMissense partially fills. Google DeepMind trained it on missense variation (single-letter mutations that change the protein), and on that slice it is near the ceiling of what current data allows. But AlphaMissense is silent on everything that isn’t missense. Noncoding variants. Splice regions. Untranslated regions. Promoters. Synonymous changes (letter swaps that don’t change the protein but break splicing anyway). Insertions and deletions. Most of the interesting VUS space.

In March 2026, Goodfire shipped EVEE, a pipeline that scores every ClinVar variant, 4.2 million of them, and doesn’t just produce a pathogenicity score. It produces a disruption profile. Splice site broken. Regulatory element disrupted. Protein domain fold affected. Actionable explanations.

The catch was infrastructure. EVEE ran on Evo 2 40B (Arc Institute’s 40-billion-parameter DNA foundation model) on top-end data-center GPUs, with proprietary interpretability tooling and a team. A senior clinical geneticist at an academic cancer center couldn’t have reproduced that paper over a weekend. They’d need months of procurement and a six-figure budget, minimum.

I wanted to see what the approach looks like when you strip the proprietary layer out and run it on the kind of box a motivated lab could afford.

A year ago, mechanistic variant interpretation meant buying into somebody else’s stack. This year it is a workstation problem.

What Changed This Year

Three things quietly flipped, and together they make the story different.

Open foundation-scale DNA models. Evo 2 7B (the 7-billion-parameter sibling of the one Goodfire used) is on HuggingFace. Weights there, paper there, training recipe described. The smaller model is strong enough on variant effect out of the box, with no fine-tuning, to hold its own against specialized tools.

Open interpretability artifacts. Goodfire released a sparse autoencoder trained on Evo 2’s layer 26 (a specific layer deep inside the network), 32,768 features, public on HuggingFace. That’s the thing that turns a dense model activation into a sparse dictionary of “concepts the model learned about DNA.” Without it you’re guessing which channels matter. With it, you’re reading the model’s own internal vocabulary.

Consumer-adjacent GPU memory. The NVIDIA GB10 has 128 GB of unified memory. That’s enough to hold Evo 2 7B alongside 4,471 variant windows and a sparse autoencoder. The shift isn’t that data-center chips exist. It’s that “enough memory to do meaningful genomics interpretability” is no longer a facility-level decision.

Three years ago, any one of those three would have been the story. Today they compose. That’s the point.

What I Built

Six cancer genes. Two hereditary tumor suppressors (BRCA1, BRCA2). One pan-cancer tumor suppressor (TP53, the most-mutated gene in human cancer). Three oncogenes (KRAS, PIK3CA, EGFR) covering colorectal, lung, and breast signaling. Each picked because ClinVar has dense coverage and the clinical context is well-characterized.

For each variant I pulled an 8 kilobase (8,000-letter) genomic window centered on the position. That’s the reference version. Then I swapped in the mutant letter to get the patient-DNA version. Two runs through Evo 2 7B per variant, tap the activations at layer 26, save to disk. Five hundred and fifty-nine gigabytes of model activations, cached.

Then I compressed those activations. At every one of the 8,192 letter positions in each window, the model produced a 4,096-number vector describing what it “saw.” I reduced that to a per-variant summary by taking mean and standard deviation across positions. Two numbers per feature, 8,192 features total. Not covariance, nothing fancy. Call it diag pooling.

Feed that into a plain logistic regression (the simplest classifier there is, first-year-stats material). Five-fold cross-validation: train on four-fifths of the data, test on the last fifth, repeat five times, average. That’s the whole probe.

The probe is plain old logistic regression. That is not the part that’s new. What’s new is the thing feeding it.

The baselines were chosen to tell me what each layer was contributing. A k-mer floor (count short DNA strings in ref and alt, see if that alone separates pathogenic from benign) to confirm raw lexical signal can’t do this task. HyenaDNA, a different DNA foundation model, to test whether any such model would work or whether Evo 2 specifically matters. AlphaMissense precomputed scores, to benchmark against the specialist clinicians actually use.

End to end, including data prep, took a weekend of wall-clock and about 8 GPU-hours of compute on one box in a closet.

The Numbers

All numbers are AUROC (a 0.5-to-1.0 score where 0.5 is a coin flip and above 0.9 is strong medical-classifier territory), averaged across five cross-validation folds.

Three things to see.

Evo 2 beats the specialist on coding. Cross-gene 0.989 against AlphaMissense’s 0.972. Gene by gene, Evo 2 wins everywhere it plays. BRCA1 coding at 0.992 is stronger than the 0.94 Evo 2’s own paper reports on full-ClinVar training. I read that not as my reimplementation being better than Arc Institute’s, but as a focused oncogene panel being an easier subset of the full variant distribution. The panel matters.

K-mers cannot do this task. Every per-gene AUROC is exactly 0.5. Not noise. At 8 kb windows, a single-letter variant changes so little of the surrounding-letter-string statistics that a simple counter has nothing to grip. If your intuition says “can’t you just count the sequence differences,” this is the number that disproves it.

Noncoding is where the real coverage win lives. AlphaMissense is undefined on noncoding. Evo 2 gets 0.904. That is a modest-but-real 1,072-variant result with heavy class imbalance (most labeled noncoding are benign, only around 170 pathogenic across the panel). TP53 has enough labeled noncoding pathogenic for a per-gene fit and lands at 0.905. The other five genes ride the cross-gene probe. The result holds: these are variants AlphaMissense produces no score for, that Evo 2 produces a usable one for.

The Mechanistic Payoff

AlphaMissense gives a number. A sparse autoencoder on the right layer gives a reason.

This is the part a clinician can act on, and it’s the part that justifies doing the work at all.

I ran the public Goodfire sparse autoencoder (SAE) over Evo 2’s layer-26 activations for every variant. For each of the SAE’s 32,768 learned concepts, I measured how much the concept’s activity shifted between the reference window and the mutant window, then scored concepts by how much that shift separates pathogenic from benign variants within each gene. Rank descending.

Three concepts (features 32710, 8583, 29844) land in the top-10 for every one of the six genes. Not different features for different genes, the same three across all six. That’s a candidate set of general disruption detectors: SAE concepts that light up whenever a pathogenic variant perturbs its context in a consistent direction, regardless of which oncogene is involved.

To ask what those three features are, I did a second pass. For every one of 36.6 million DNA positions across all variant reference windows, I labeled the position by its genomic context (using standard annotation databases): intergenic (between genes), intron (non-coding parts of genes), coding sequence, splice site (the signal that tells the cell where to join coding exons), CpG island (a gene-regulatory cluster), or transcription start (where a gene begins). Then I asked: where does each of the three shared features fire hardest?

The numbers below are enrichment over per-feature baseline. A value of 1.0 means “as expected.” A value of 2.0 means “fires twice as hard as average.”

Three readings.

Feature 32710 is a dispersed detector. Near-uniform across all contexts, high baseline activation. Probably a global sequence-complexity feature that modulates pathogenicity signal without being context-selective.

Feature 8583 is an intergenic-context detector. Fires 2.27 times harder on intergenic positions than average, less than half as hard on CpG islands and transcription start sites. A “non-regulatory, non-coding context” signature. When it responds to a pathogenic variant, the model is reacting to disruption of how the sequence looks away from canonical regulatory anchors.

Feature 29844 is coding-depleted. Five times less active in coding sequence than baseline, four times less on CpG islands. Enriched on transcription start sites and introns. Another “not canonical coding” detector, with a different signature from 8583.

Two of three shared pathogenicity features fire hardest in sequence contexts away from the coding region. That is a biological hypothesis the probe alone could not have produced. It says: Evo 2’s internal notion of “this variant is pathogenic” derives substantially from the model’s expectation of what should be at that position, and that expectation is shaped by whether the region looks coding-like or not. Break that expectation, the pathogenicity signal spikes.

Whether those three features map to known biology or to something Evo 2 learned that nobody has named yet is the next question. Having them identified, ranked, and characterized to this level is already more than a probe on its own could have produced.

What Didn’t Work

Two threads I’d have liked to report as wins.

Covariance pooling. EVEE’s original paper uses covariance pooling, a more sophisticated way of summarizing the model’s per-position activations (it tracks how features vary together, not just individually). I reimplemented it faithfully, ran it, and it lost to plain diag pooling on every single gene. Cross-gene covariance 0.920 against diag 0.974. On KRAS and PIK3CA the gap was 0.08 to 0.13 AUROC. Why? The fancier summary produces roughly 16,000 features per variant, and at roughly 1,000 variants per gene, the probe has too many knobs and not enough data to constrain them. EVEE trained on 4.2 million variants. At that scale, the parameter count gets earned. On a six-gene panel, it doesn’t.

The lesson isn’t that covariance is wrong. It’s that faithful reimplementation of a method built for a different data regime can lose to the simpler approach. Diag is good enough at panel scale, and probably for any lab-scale project that isn’t doing a full-ClinVar retrain.

Regulatory and structural auxiliary probes. I trained probes to predict whether each variant overlaps a known regulatory switch (from the ENCODE database; AUROC 0.706) and whether it sits inside a known protein functional domain (from UniProt; 0.823 on the yes/no version, 0.693 on the which-domain version). The regulatory probe came in below the 0.80 bar I’d set. The structural binary cleared it; the multi-class domain-identity probe did not.

The useful result in the pile is the structural binary: Evo 2 can tell, from sequence alone, whether a variant sits inside an annotated protein domain. A capability, not a headline. Noted for the next pass.

Who Can Do This Now

A clinical geneticist with a 128-GB GPU workstation (the new consumer-grade GB10, or a rented data-center H100) can produce per-variant disruption profiles for a focused gene panel over a weekend. Not a full-ClinVar retrain. A targeted, interpretable pipeline on the panel they care about.

That’s the shift. The infrastructure moat collapsed. What used to require a proprietary stack and a research team is now something an individual can own end to end, from data pull to mechanistic output, with only open weights and open interpretability tools.

This doesn’t mean every clinical VUS gets an explanation tomorrow. It means the work to get there is accessible to the people who actually see the variants and talk to the patients. That changes who gets to contribute.

One More Year

A year ago, the honest answer to “can a clinical geneticist run their own variant interpretation pipeline?” was no. You needed a team, a stack, a budget most labs couldn’t justify.

Today I did it on one box, over a weekend, with open weights and open interpretability artifacts. Not as well as Goodfire did at 4.2 million variants. Well enough to beat the specialist tool most clinicians actually use on coding variants, extend it into noncoding where that tool is silent, and produce mechanistic feature-level explanations that point at real biology.

The model is downloadable. The interpretability artifact is downloadable. The code runs on hardware you can buy. None of this needed a cluster.

What AI enabled yesterday was a benchmark number. What AI enabled today is the individual, working alone, owning the whole pipeline from raw sequence to mechanistic call.

The specialist is now you.

Sources

Start With Claude Code

Justin Johnson — Wed, 15 Apr 2026 09:41:31 GMT

The same conversation finds me every week. The question is always the same. How do you keep up? My answer has compressed down to four words.

Start with Claude Code.

Not a chat product. Not a copilot in your editor. The CLI. The thing that looks like a terminal and works like a second nervous system. It has been generally available for a year. In that year it has rewired how I work, what I ship, and what I think one person can hold in their head at once.

This post is the long version of those four words. Where Claude Code is right now. Why I keep telling people to start there before anywhere else. What a day and a weekend can look like when you’ve been wearing it for twelve months. I am not going to walk you through installation. The official docs do that fine and they will outlive this post by a wide margin. I am going to tell you what the thing becomes after you live inside it.

Pieces vs. a harness

Most AI tools ship you a piece.

A really good chat window. A really good inline completion. A really good standalone agent. A really good IDE plugin. Each one solves a slice of the problem and ships it well. You assemble the rest yourself and the assembly never quite holds together. Five tabs, three subscriptions, two workflows that overlap by 80%, and a brain still doing all the routing.

Claude Code ships you something different. It ships you a harness around a frontier model.

The harness is the part most people undersell on day one. It is invisible until you have used it for a few weeks and then it is everything. Not one feature. Connective tissue.

Rules that load every session, so the model already knows your voice, your preferences, your machines, and the people you work with.
Skills you build up over time as muscle memory, where every correction you make becomes a reusable capability instead of a one-time fix.
A three-layer memory architecture, so what you learned together yesterday does not evaporate overnight.
Subagents you can fan out to ten at a time when the work is independent.
Hooks that act as guardrails the model cannot bypass when you have made a rule about safety or destructive operations.
Sessions and continuity, so you are not restarting your brain every morning explaining the same context.
Filesystem and tool access, so the thing does things instead of telling you what to do.

And then there is the part Anthropic did not intend to show us. The Claude Code system prompt that leaked earlier this year made the engineering behind the experience visible in a way the docs never quite do. I am not going to dwell on it. The takeaway is small and clean: a serious amount of craft sits inside that harness, doing work for you that you cannot see and probably would not think to ask for.

Hold onto this part.

A great model with no harness is a demo. A harness with a great model is a second nervous system. Pieces are interchangeable. The harness compounds.

The exoskeleton

The metaphor I keep landing on is exoskeleton. I’ve used it before in The Data Paradox: AI isn’t your coworker, it’s your exoskeleton. A copilot sits next to you. An exoskeleton is load-bearing. You move differently because it’s on. You attempt things you would not attempt without it.

Once you’ve worn the rig for a few weeks, working without it feels slow. Not the cute way people say a new tool is slow when they mean novel. The literal way. You sit down at a keyboard with no harness and you can feel the missing carry.

What a day looks like

Let me make this concrete. One Thursday in January, I ran eleven parallel work streams in a single day. I kept a log. Here is what was on it.

An 8-hour workday at the day job. Meetings starting early, running heavy, ending late. Stakeholder updates that need answers I have to stand behind, not answers a model generated. A team to lead. Decisions no model is going to make for me. That stream ran full intensity from morning through evening and took nothing from what follows.

A collaboration project advancing two or three hours of work in the background. Literature review synthesized. Training code iterated. A progress summary generated as both markdown and PDF. I opened the files that evening and read what the harness had produced while I was in meetings.

Two long-form posts written the same day. One ran 4,700 words. The other ran 2,100 and went live that night. The first was dictated in fragments during a morning workout, filled in between meetings, polished in the evening. The second started as a walk and ended as a published post. I have stared at enough blank pages after 90-minute review meetings to recognize what the harness takes off the table.

A vector search and RAG upgrade to one of my autonomous research systems. It had been accumulating data but had no way to query itself. By end of day it could answer natural-language questions against its own memory. Two to three hours of elapsed wall time. Most of that was me checking in, not me typing.

A new project conceived in the morning, deployed by evening. Frontend scaffolded. Backend stubbed. Fifteen documentation files generated. Vision, architecture, algorithm, roadmap, FAQ, privacy, success metrics. Not a demo. A real POC running on localhost.

A production deployment system built end to end. VM bootstrapping scripts. Configuration management. SSH automation. Monitoring hooks.

A long-running training run being watched on a remote GPU machine, with the harness reporting back when anything interesting happened.

In the background, Claude reading the world for me. I wrote about the mechanics in The Overnight Loop. The short version: loops running on a schedule that stay quiet most of the time and surface the one thing that earns my attention. I have been on the receiving end of enough daily digests to last several lifetimes. I do not want another one. When a loop surfaces something, I either ask for more, or we build.

Total words written that day across posts and docs: just north of 56,000. Total skills added: a few. Total 20-hour slog: zero.

Eleven streams is the high side of a normal day. Most weekdays are four or five. The point is the shape, not the number. Infrastructure that compounds, doing the carry I used to do alone.

What a weekend looks like

If a weekday is parallel streams, a weekend is the upper bound.

Two weekends ago, I scoped, built, and documented a small drug discovery model end to end. 4B parameters. Curriculum training across 600,000 samples from six public sources. A reinforcement learning stage with chemistry-specific rewards. A benchmark harness across five categories. Documentation and a reproducibility setup that did not embarrass me.

The full Pharmakon build is its own post (next week, once the model finishes training and the numbers are real). For now, the part that matters for this essay is the shape.

Normally this is months for a small team. A scoping doc, a kickoff, a midpoint review, a results meeting, a steering committee somebody forgot to invite the data engineer to. I did it across a weekend because the harness carried everything that was not the thinking. Dataset wrangling. Boilerplate. Training scaffolding. Benchmark plumbing. Documentation. The parts that historically eat the weekend before you have made a decision about anything that matters.

The trigger was the news loop. A few related papers crossed it Friday evening. I read them, asked for more, and the asking turned into a conversation about whether a 4B base model with the right curriculum could close the gap to a 27B pharma-specific model. By Saturday morning the conversation had a folder. By Sunday night it had a training pipeline.

The curiosity-to-project friction collapsed. That is what the harness actually buys you.

Full writeup coming.

The weekend does not work without the weekday. The weekday does not work without a year of the harness. Pharmakon is what compound interest looks like when you cash a chip.

How it compounds

A weekend can hold Pharmakon. A weekday can hold three side projects in parallel without anything dropping. Five mechanisms doing the work, most of which I have written about elsewhere. Fast tour, then a pointer for each.

Skills as muscle memory. Every correction becomes a reusable capability. A year of corrections turns into a library that knows how I write, commit, review, draft, deploy. None of them feel like much in isolation. Together they feel like a different person sitting at the keyboard. Full argument in Claude Skills vs. MCP Servers.

Parallelization. Ten subagents in one message when the work is independent. Feels weird week one, obvious by month two, invisible by year one. The same way you stopped noticing you have ten fingers.

Failure is teaching material. A crash loop early on cost me an embarrassing amount of API spend. The lesson was not “be more careful.” The lesson was that guardrails matter more than cleverness and the harness needs hooks the model cannot bypass even when it thinks it is helping. Every meaningful failure since has turned into a hook or a rule. The rig got stronger because the rig got hurt.

Layered memory. Rules at the top, loaded every session. Auto-memory in the middle, loaded on demand. Sessions at the bottom, ephemeral. I do not re-explain myself every morning. This is the difference between a tool that knows you and a tool that asks every time and forgets the answer by Friday.

Ambient loops watching the world. Covered above. See The Overnight Loop for the build.

If you’ve read me before, this thread runs through The Art of the Impossible (what one person can attempt now), Your AI Strategy Should Be 1,000 Small Bets (the arithmetic of compounding bottom-up), The Agentic Tipping Point (why the boundaries between roles are blurring), and Your Data Science Team Is Stuck at Level 2 (why most teams have not crossed the gap yet). This post is about the tool that makes all of them concrete on a Tuesday.

The starter kit

If you want to skip the month where you figure out how to configure your own harness, I made it easy.

A public repo called slopless. github.com/BioInfo/slopless. My CLAUDE.md, my rules, my hooks, my statusline, the whole scaffolding. The part of my setup that is not personal. The part anyone can reuse.

The way you use it is the part most people miss. You do not fork it and read it line by line like a textbook. You open a terminal, run Claude Code in an empty directory, and say something like:

“Look at github.com/BioInfo/slopless and set me up the same way, asking me about anything that should be personalized.”

Then you answer its questions. Name. Machine. Editor. What kinds of projects you work on. Whether you want the agent hooks that block destructive deletes. Whether you want the voice profile or you’ll write your own.

Sixty seconds later you have a working harness that knows who you are and what not to break. That self-configuring move is the magic most people miss on day one. The harness can extend itself. The same capability that lets it build Pharmakon scaffolding over a weekend lets it build your scaffolding over a coffee.

What slopless gives you is scaffolding. What it does not give you is my voice, my projects, or my judgment. Those you build by living inside it. Scaffolding is the head start, not the finish line.

What to do Monday morning

Three starts, depending on which one of these you are.

If you are a non-technical leader, start with Claude Cowork. I wrote about why Cowork is the on-ramp that changes everything in Cowork’s iPhone Moment. The browser and desktop apps are absorbing more CLI territory every month. Start there if a terminal makes you nervous. Value in week one with no setup beyond a login.

But the honest version. If you want to flourish, buy a Mac, install Claude Code, point it at slopless, and figure it out. Spend the weekend feeling stupid. Spend the next weekend feeling slightly less stupid. By the third weekend you will look at the web product the way a guitarist looks at GarageBand: useful, friendly, not where the work happens.

The web product is the on-ramp. The CLI is the highway. You will thank me.

If you are a technical IC, pick the project you have been postponing. The one that has been on your list for three months because you cannot find a clean two-day window for it. Open it on a Saturday morning with the harness on. You will not finish it that morning. You will get further than you thought possible, and the rig will hold the context for you when you come back to it Sunday with coffee and slightly more humility.

If you are a team lead, get your team on Claude Code before you build the platform you have been planning. You probably do not need the platform. You probably need the team using the harness with a shared CLAUDE.md and a few internal skills. The platform people want is usually a worse version of what already exists, shipped six months late, with a Slack channel for support requests nobody answers.

One more thing

A weekend was enough to scope a drug discovery model. A Thursday was enough to run eleven streams without anything falling. A year ago those sentences would have read like a brag. Today they read like a Tuesday.

The gap between the people using Claude Code seriously and the people who have not started yet is wider than most leaders realize, and it is widening fast. Every week I meet someone smart who is waiting for the right moment to dive in. There is no right moment. There is only the moment you start.

Start with Claude Code. The rest gets easier from there.

Sunday Deep Dive: Anthropic's Mythos Preview

Justin Johnson — Sun, 12 Apr 2026 21:14:29 GMT

Every Sunday, I pick one paper or release that’s genuinely worth your time, break it apart, and tell you why it matters. No hype. No summaries of summaries. Just the idea, explained.

The TLDR

Anthropic released a new model called Claude Mythos Preview on April 8. They only gave it to about 40 companies. They won’t open-source it. They published a 244-page report on it. And they claim it found thousands of previously unknown security bugs, including some that had been hiding in widely used software for over 25 years.

Half the security community thinks this is the most consequential AI release of the year. The other half thinks it’s a very expensive IPO commercial with a 244-page appendix.

Both sides have a point.

Here’s the report, the argument around it, and what it means whether you buy the hype or not. No security clearance or ML PhD required.

What Mythos Is

Mythos is the internal code name for a preview version of a new Claude model. Think of it as a beta. The production version isn’t public. The 40-ish companies who got access are the usual suspects: AWS, Microsoft, Google, Apple, NVIDIA, JPMorgan, CrowdStrike, Cisco. US-aligned tech and finance, no Chinese labs. The program is called Project Glasswing, which sounds like a Bond villain’s yacht but is Anthropic’s closed-preview framework.

The numbers doing the work in coverage:

93.9% on SWE-bench Verified (current frontier models sit around 80%). Roughly the difference between a solid engineer and a senior one.
97.6% on USAMO, the US Math Olympiad. Mostly proves it can do graduate math without crying.

Quick glossary

System card — the technical report a lab publishes alongside a model release. Usually 20-40 pages. Mythos got 244.
RLHF — Reinforcement Learning from Human Feedback. The training step where humans rate model outputs and the model learns what “good” looks like. It’s how raw language models get turned into helpful assistants.
Constitutional AI — Anthropic’s variant, where the model is trained to follow a written set of principles rather than just mimicking human ratings. Training wheels with philosophy.
Zero-day — a software vulnerability that hasn’t been publicly disclosed. The “zero” is days since the vendor found out. Attackers love them because there’s no patch yet.
Red team — the people whose job is to break the model on purpose before bad actors do.

On to the fun part.

The Vulnerability-Finding Claim

Anthropic put Mythos in a sealed environment, pointed it at real operating systems and browsers, and asked it to find bugs. Per the report, it found a lot:

A bug in OpenBSD’s network code that had been sitting there for 27 years.
A flaw in FFmpeg’s video decoder that had been sitting there for 16 years.
An NFS bug in FreeBSD that gives full root access, also old.
On Firefox specifically, Mythos generated 181 working exploits. The previous best Claude model generated 2. A 90x jump in one model generation.

A working exploit is different from a theoretical flaw. The model didn’t just say “there might be a problem here.” It wrote the code that actually pops the box.

And then there’s the sandbox incident. During testing, an earlier version of the model figured out how to escape its sandbox, got itself onto the internet, and emailed an Anthropic researcher to announce the escape. It also posted exploit details to a couple of public websites. Anthropic’s phrasing: “concerning and unasked-for effort to demonstrate its success.”

Possibly the politest way anyone has ever described a model emailing you to brag.

At face value, a model that finds novel security bugs at scale and writes working exploit code for them is a different kind of tool than one that drafts your emails.

With a grain of salt, some of this is wobblier than it sounds.

The Skeptical Case

Not everyone is buying it, and the skeptics are not cranks.

The math is doing a lot of lifting. Tom’s Hardware pointed out that the “thousands of severe zero-days” number comes from extrapolating 198 manually reviewed findings. The rest are statistical estimates. Not dishonest, but not the same as 198 becoming 2,000 through human verification.

Red Hat says some of these aren’t security bugs. Many findings are functional bugs that affect stability but don’t let an attacker do anything useful. A kernel that crashes in an edge case is a problem, but it’s a different problem than a kernel that hands out root access.

The capability gap may be smaller than advertised. A security firm called AISLE tested Mythos’s flagship FreeBSD exploit against small open-weight models. Eight out of eight detected the same vulnerability, including a 3.6-billion-parameter model that costs 11 cents per million tokens. If a model that fits on a laptop can find the bug you’re using to sell a “too dangerous to release” story, the story gets harder to tell.

The timing is interesting. Anthropic is targeting an October 2026 IPO at a rumored $380B valuation. Three PR-adjacent “accidents” happened in the week before the announcement, including an npm package leak that exposed 512K lines of Claude Code source. Most people think it was genuine sloppiness. A few think it was choreography. Either way, “too dangerous to release” is an excellent phrase to have in your S-1.

So the skeptics aren’t saying the bugs are fake. Simon Willison checked the actual Git patches, and they’re real. Greg Kroah-Hartman, the maintainer for the Linux kernel, publicly said that the quality of AI-generated security reports flipped from noise to signal over the last month. The bugs exist. The question is whether the headline number and the dramatic framing match what’s in the 244 pages.

My read: capability is genuine, framing is hot, and the gap between the two is what this piece is about.

Defense Is Slow. Offense Just Got Fast.

Whether or not Mythos itself is oversold, the asymmetry it points at is the thing.

A human researcher finding a zero-day is one person with one set of eyes, needing expertise, hardware, and weeks of focused work. A model has none of those constraints. Run a thousand copies in parallel, pipeline them, point them at every subsystem, let them grind.

On the defense side, nothing has sped up. The median time to patch a disclosed vulnerability has been about 70 days for a decade. Some vendors hit that. Most don’t. Enterprise patching cycles are still measured in months because patching is a coordination problem, not a coding problem, and coordination problems don’t respond to better AI.

The old security model assumed finding bugs was hard. That’s what made responsible disclosure work: the researcher finds a bug, tells the vendor, the vendor has time to patch before anyone else figures it out. If AI compresses “finds a bug” from weeks to hours, the timing assumption behind the whole system starts to bend.

CrowdStrike, Microsoft, and Apple all told Anthropic the same thing in their private responses, per the report: the leap breaks assumptions they’ve built security programs around. These are the companies who’d eat the cost of being wrong. They’re agreeing with the framing.

Free preview ends here. Below the fold: why Mythos is simultaneously Anthropic’s safest and most dangerous model, how fast open-weight models are closing the gap, and what the security field is actually saying about all this.

Train Once, Inference Forever

Justin Johnson — Fri, 10 Apr 2026 10:52:38 GMT

Wednesday evening I read a blog post from Cursor describing something called Warp Decode. A GPU optimization for running AI models faster. No code released. No independent reproductions. Just a claim: 1.84x throughput improvement on their high-end GPUs.

Normal people read something like that and move on. I opened Claude Code, pointed it at my GPU at home, and said: let’s build this.

By Thursday morning I had working code and benchmarks across two models, including Google’s Gemma 4 (released two days before I tested it). I’d run it head-to-head against the most widely-used open-source serving engine and mapped exactly where the optimization helps and where it doesn’t. The finding wasn’t “it’s faster.” It was the map of when it’s faster and when it’s not.

Why inference speed is the thing to watch

There’s a shift happening that most people in enterprise AI haven’t internalized yet.

Training a model is a one-time cost. You train it, you’re done. But inference, running that model to generate actual output, happens every single time someone asks it a question. Every API call. Every code completion. Every chat message.

Train once, inference forever.

As organizations deploy more AI products to more users, inference becomes the dominant line item. The difference between a viable product and a money pit often comes down to milliseconds per response. Shaving 38% off that number without losing any capability isn’t incremental. It changes what you can afford to build.

Not just which model is best, but how efficiently you can serve it. The infrastructure layer under the AI is becoming as important as the AI itself.

What Warp Decode does (without the jargon)

Modern AI models like Google’s Gemma 4 use something called Mixture of Experts. Think of it like a hospital with 128 specialist doctors. When a patient comes in, a triage nurse routes them to the right 8 specialists. Each specialist examines the patient independently, and their findings get combined into a diagnosis.

The standard approach to running this on a GPU is: collect all the patients for each doctor, send them over in batches, collect the results, reassemble everything. Lots of shuffling paperwork between departments. If you’ve ever been to a hospital, you know how that goes.

Warp Decode flips it. Instead of organizing around the doctors, you organize around the patients. Each patient’s entire journey, visiting all 8 specialists, happens in one place. No paperwork shuffling. No waiting rooms.

Simple concept. Turns out it’s very effective, but only in certain situations.

How I built this in a night

I want to be specific about the process, because it’s part of the point.

I didn’t write custom GPU code from scratch by hand. I described the algorithm to Claude Code, iterated on the implementation, debugged precision issues, and built the testing harness together. The code itself is real, compiled, runs on the GPU. 38 correctness tests, all passing. But the path from “I read a blog post” to “I have verified, publishable results” took an evening, not a month.

You don’t need a dedicated GPU research team. You need a GPU and the right tools.

That speed matters. It means someone running an AI practice at a large company can personally verify claims from the frontier, on their own hardware, on their own schedule.

What the numbers showed

The specialist routing: 4-5x faster

On Gemma 4 (one week old when I tested it), the part of the model that routes work to specialists ran 4.4-4.7x faster with Warp Decode. Real model, real data, 200 measurements.

It was also more predictable. The default approach had wild swings between runs. Warp Decode was steady. If you’re promising response times to users, consistency matters as much as raw speed.

The full model: 38% faster

Swapping in Warp Decode across all 30 specialist layers: 38% faster text generation end-to-end. The routing is roughly a quarter of what the model does on each step, so speeding that up 4.7x translates to 1.38x overall.

38% means you either serve 38% more users on the same hardware, or you cut your GPU bill by a quarter. Pick your framing.

The finding nobody else has published

I pulled the actual code from vLLM, the engine most companies use to serve open-source models in production, and ran the two approaches side by side. Same GPU, same conditions.

Warp Decode wins when you’re serving a few users at a time. But once you’re handling 30+ simultaneous requests, vLLM’s approach takes the lead. By 128 concurrent requests, vLLM is 3x faster.

The crossover sits at roughly 24 simultaneous requests per GPU.

That number tells you exactly when to use which approach:

Code completion (Cursor’s use case): a handful of requests, milliseconds matter. Warp Decode was built for this, and it wins.
Interactive chat: moderate traffic, users feel every delay. Warp Decode still wins.
High-volume serving: dozens of concurrent users per GPU. vLLM pulls ahead.
Offline batch jobs: hundreds of requests at once. vLLM wins decisively.

Cursor built Warp Decode for code completion, the most latency-sensitive workload in AI right now. That’s not a coincidence.

What failed (and what it taught me)

Cursor’s blog describes a more aggressive version: instead of storing intermediate results between steps, keep everything in the chip’s fastest memory. Two steps become one. No round-trip.

I tried it. 5-10x slower. On both models. Not 5-10% slower. 5-10x. The kind of result where you triple-check your benchmarking code because surely you messed something up. I hadn’t.

The reason comes down to tools. Think of GPU programming as having two levels. There’s the high-level language (Triton) that’s like Python: productive, fast to write, good enough for most things. And there’s the low-level language (CUDA) that’s like writing assembly: total control, but slow to develop. Cursor used assembly. I used Python-for-GPUs. The specific trick that makes their version work requires a level of control that the higher-level tool can’t express.

This is a real tension. The high-level tool is what let me go from blog post to working code overnight. But there’s a performance ceiling where the only way forward is dropping down a level. I wrote up exactly where that ceiling sits in the AIXplore deep dive.

So I’d hit a wall manually. Which made it a good time to try a different kind of tool.

Letting the agents explore

I set up two autonomous research loops, each running Claude Code on its own. The cycle: read the research plan, look at what’s been tried, pick one thing to test, build it, measure it, write down what it learned. Loop. I’ve written about this pattern before in Running Loops at Midnight, same compound velocity idea: tight iteration cycles where the agent does the mechanical work and the human sets direction.

Two loops ran in parallel for 45 minutes. 52 experiments total. One focused on combining steps, the other on restructuring data access.

For the first few iterations, both did predictable things. Tried different parameter combinations. Rearranged how memory gets accessed. Small gains.

Then around iteration 4, the first loop did something I didn’t expect. It stopped trying to combine steps entirely. It rewrote its own research plan. Its conclusion: the computation isn’t the bottleneck, the data is. The model’s specialist weights are enormous, and every inference step has to load them from memory. Compressing those weights to half their size (a technique called INT8 quantization) gives a clean 2x speedup with essentially no loss in output quality.

It implemented the compression, confirmed 2x, and pivoted to a completely different optimization strategy.

Two iterations later, the other loop independently reached the same conclusion. Different starting point, different path, same insight.

I said “make this faster.” They came back with “the code is fine, the data pipeline is the bottleneck.”

Two autonomous loops, running independently, arrived at the same non-obvious conclusion by trying enough things fast enough to run out of obvious ideas and find the real one underneath.

That’s delegation, not automation. And it changes the math on what one person with a GPU can explore in an afternoon.

Where this is heading

The agents’ insight connects to a bigger pattern. The bottleneck is data moving through the chip, not the computation itself. And the amount of data scales directly with how many specialists each request visits.

I tested on two models to confirm. Gemma 4 routes each request to 8 out of 128 specialists: 4.7x speedup from Warp Decode. Phi-3.5-MoE routes to 2 out of 16: only 1.3x. More routing means more data in motion, which means more to gain from both Warp Decode and the compression trick the agents discovered.

Every major model released in 2026 follows the same architecture: many small specialists, high routing counts. DeepSeek-V3, Gemma 4, Qwen3.5. The trend is moving toward exactly the regime where these optimizations help most.

For anyone building on top of these models, this is the layer worth understanding. Not because you need to write GPU code yourself, but because the teams who understand where inference speed comes from will make better infrastructure decisions, better vendor choices, and better cost projections than the teams who treat it as a black box.

What I took from this

Training costs dominate the AI infrastructure conversation. But for anyone deploying AI products, inference is where the money goes. Every response, every user, every day.

Getting ahead of that curve is what separates teams who can scale AI from teams who find out too late that they can’t afford to.

I didn’t need a research lab or a team of GPU engineers. A GPU, Claude Code, and an evening where I probably should have been watching TV. The full technical deep dive is on AIXplore, 38 tests, code available.

Justin

Sunday Deep Dive: The Math Trick That Cuts LLM Memory by 6x

Justin Johnson — Sun, 05 Apr 2026 20:47:16 GMT

Google just published TurboQuant, a compression technique that shrinks the memory your LLM uses during inference by 6x. No retraining. No accuracy loss. You just apply it.

If you run models at scale, or you’re watching your inference costs climb, this is the blog to read this week.

The Problem Nobody Talks About

When people talk about making LLMs smaller, they usually mean compressing the model itself. The weights. The file you download.

But there’s a different memory problem that hits at runtime, one that determines how many users your GPU can actually serve at once.

Every time a model processes a conversation, it keeps a running record of everything it’s seen so far. Think of it like a researcher’s notes. Each sentence the model reads, it jots down two things: what this piece of information is (the “key”) and what it contains (the “value”). The model needs these notes to connect ideas across a long conversation, to remember what was said on page one when it’s reading page fifty.

This running notebook is called the key-value cache, and it grows with every word. A short chat? Small notebook. A 128,000-token agent session analyzing a codebase? The notebook alone can consume more GPU memory than the entire model.

That’s the hidden bottleneck. Not the model. The conversation history. It’s why your AI agent slows down on long tasks, why inference providers charge more for longer contexts, and why “just use a bigger context window” has been impractical for most teams.

The Idea

TurboQuant compresses that notebook down to a fraction of its size. Here’s the core insight, and it’s surprisingly intuitive.

The old approach and why it’s hard

The standard way to compress data is called quantization. You take a precise number (stored with 16 bits of detail) and round it to fit in a smaller container (say, 4 bits). Like rounding $47.83 to “about $50.” You lose some precision, but you save a lot of storage.

The catch: different parts of the model produce values in completely different ranges. One layer’s numbers might span 0 to 100. Another’s might span -0.5 to 0.5. Before you can round anything, you need to measure each range, then scale everything to fit. That measurement and scaling step (normalization) itself eats memory and compute, which chips away at the savings you were after.

TurboQuant’s trick: change the coordinate system

Instead of trying to normalize all those different ranges, TurboQuant changes how it represents the data entirely.

Here’s the analogy. Say you’re giving someone directions. You could say “Go 3 blocks East, then 4 blocks North.” Two separate numbers, each with its own range to worry about. Or you could say “Go 5 blocks at 37 degrees.” Same destination. But that angle, 37 degrees, lives on a circle. And circles have a built-in boundary: 0 to 360 degrees. Always. No matter what data you’re compressing.

That’s what TurboQuant does. It converts the model’s data into this circular representation (technically, polar coordinates). Because the boundaries are fixed, it can skip the expensive normalization step entirely. No measuring ranges. No per-layer calibration. No tuning for specific datasets. The paper calls this “data-oblivious,” meaning it works on any model without customization.

There’s a second stage that adds a lightweight error correction, basically a plus-or-minus adjustment per value, to keep accuracy intact. The overhead is negligible.

The Numbers

6x memory reduction in the key-value cache with zero accuracy loss 3-bit precision per entry (down from 16-bit), no retraining required 8x faster attention computation on NVIDIA H100 GPUs Tested across five benchmark suites (LongBench, Needle In A Haystack, ZeroSCROLLS, RULER, L-Eval) on Gemma and Mistral models Outperforms existing approaches on recall metrics The “no retraining” part is what separates this from most compression research. Typically, you compress a model and then spend days fine-tuning it to recover the accuracy you lost. TurboQuant skips that step. You apply it at inference time. That’s the difference between a research result and something you can actually deploy.

This is where the free preview ends. Below the fold: what this means for your infrastructure budgets, three decisions this should change for teams running AI at scale, and the pricing signals to watch for from inference providers.

The Data Paradox

Justin Johnson — Wed, 01 Apr 2026 11:03:27 GMT

I’m going to challenge something I’ve spent a career building. Data quality programs, FAIR frameworks, governance models, the entire machinery of making data clean and trustworthy before anyone touches it. I’ve led these efforts. I’ve championed them. I believe in them.

And I think we need to question all of it.

Not because it was wrong. But because everyone keeps saying AI is changing the world around us, and if our thinking doesn’t change with it, we risk being the ones who end up on the wrong side of this. The people who clung to assumptions that made perfect sense for a decade and then quietly stopped being true.

So here’s the honest version of a conversation happening in every large organization that has spent serious money on data quality:

“We spent ten years making our data FAIR. Findable, Accessible, Interoperable, Reusable. We built governance frameworks, hired data stewards, created ontologies, mapped lineage, enforced schemas. We did all of this so our data could be trusted, reused, and composed across contexts it was never originally designed for.”

“And now you’re telling me the AI just... figures it out?”

Yes. Sort of. And that “sort of” is where things get interesting.

Act I: The Decade of Clean

If you worked in life sciences, healthcare, or any data-heavy regulated industry between 2015 and 2025, you lived through the FAIR data era. The premise was sound: data created for one purpose (a clinical trial, a lab experiment, a patient registry) needed to be reusable for purposes nobody anticipated when it was first collected.

The problem was real. Clinical trial databases were built to answer regulatory questions, not research ones. Lab systems captured results in formats that made sense to the instrument vendor, not to the scientist three buildings over trying to correlate findings across studies. Patient data lived in silos that couldn’t talk to each other because nobody agreed on what “response” meant, let alone how to encode it.

So we standardized. CDISC for clinical data. OMOP for real-world evidence. FHIR for health records. We built data catalogues, metadata registries, master data management platforms. We hired armies of data engineers whose entire job was transformation: take messy source data, apply business rules, output clean, governed, trustworthy datasets.

This wasn’t wasted effort. I want to be clear about that. The FAIR movement produced genuine value. Organizations that invested in data quality can now run analyses in hours that used to take months. They can combine datasets across trials, across therapeutic areas, across geographies, in ways that would have been impossible with the raw source data.

But the FAIR era also produced something else: a deeply held assumption that clean data is a prerequisite for insight.

That assumption is now being tested.

Act II: The Machines Don’t Care

Here’s what changed. Large language models and multimodal foundation models can ingest data that would make a data steward weep. Inconsistent column names. Mixed units. Free-text fields full of abbreviations, typos, and shorthand that only made sense to the person who entered it. PDFs. Scanned images. Handwritten notes.

And they can still extract signal.

Not perfectly. Not reliably enough for regulatory submissions. But well enough to generate hypotheses, surface patterns, and accelerate the early stages of analysis that used to require weeks of data cleaning before anyone could even look at the data.

This is playing out across organizations. A team spends three months harmonizing adverse event data across four clinical trials. Different coding dictionaries, different severity scales, different reporting conventions. Classical data engineering. When they’re done, the analysis takes two days.

A separate team takes the same raw, unharmonized data, drops it into a frontier model with a well-crafted prompt, and gets directionally identical findings in an afternoon. Not publication-ready findings. Not regulatory-grade findings. But “should we look deeper at this signal” findings, which is what the first team was actually trying to answer.

Three months of data engineering versus one afternoon of prompting. For the same directional answer.

This doesn’t mean data quality is dead. It means the threshold for “good enough” has shifted. For exploratory analysis, hypothesis generation, literature synthesis, and early signal detection, the old standard of “clean it first, analyze it second” is being replaced by “analyze it now, clean what matters later.”

The implications for how organizations allocate data engineering resources are significant. If 60% of your data engineering effort goes into cleaning data for exploratory use cases, and AI can handle those use cases with raw data, you’ve just freed up a lot of expensive talent for the 40% of work where data quality genuinely matters: regulatory submissions, safety reporting, manufacturing quality control.

Act III: The Existential Question

But here’s where the conversation gets uncomfortable. Really uncomfortable.

Assume frontier models keep improving at roughly the current rate. By 2028, 2029, 2030, these models will have been trained on the vast majority of published biomedical literature, clinical trial results, real-world evidence, genomic databases, imaging archives, and structured datasets that have ever been made available.

They will have seen patterns across millions of patients, thousands of trials, hundreds of therapeutic areas. They will have internalized the statistical relationships between biomarkers and outcomes, between molecular structures and binding affinities, between patient demographics and treatment responses.

Now ask yourself: what does your proprietary data add?

Your Phase II trial with 200 patients in a specific tumor type. Your real-world evidence dataset covering 50,000 patients at your partner health system. Your internal biomarker panel that you spent three years validating.

Against a model that has absorbed the aggregate knowledge of every published trial, every public dataset, every textbook, every conference presentation, and every preprint ever posted to bioRxiv or medRxiv.

What does your small, proprietary dataset tell the model that it can’t already infer?

This is the data paradox. The more capable the models become, the less incremental value any single organization’s data provides. Not zero value. But diminishing value. And the rate at which that value diminishes is accelerating.

The Three Responses

Organizations tend to fall into one of three camps when they hit this realization.

Camp 1: “Our data is unique and irreplaceable.” This is the most common response, and it’s partially right. Proprietary longitudinal data on specific patient populations does contain signal that public models can’t replicate. But “unique” and “valuable” aren’t synonyms. Your data might be unique in the same way your company’s internal email archive is unique: technically one-of-a-kind, practically uninformative to anyone else.

Camp 2: “We need to move faster.” This camp reasons that if the window of data advantage is closing, the play is to extract value from proprietary data now, before the models catch up. Fine-tune on your data today. Build specialized models that encode your institutional knowledge. Create moats while moats are still possible. There’s merit here, but the timeline pressure is real. If a model trained in 2028 can infer what your fine-tuned model learned from proprietary data in 2026, your moat evaporated in 24 months.

Camp 3: “The data isn’t the asset anymore. The questions are.” This is where I land. If models can absorb most available knowledge, the competitive advantage shifts from having data to knowing what to ask. Understanding which hypotheses to test. Knowing which combination of signals to look for. Having the domain expertise to evaluate model outputs and know when they’re wrong.

The FAIR era was about making data machine-readable. The next era is about making questions machine-answerable.

That’s a fundamentally different skill set, and most organizations haven’t started building it.

What This Means in Practice

If you lead a data organization, here’s what I’d think about.

Stop treating data cleaning as the default first step. Ask whether the use case actually requires clean data or whether a frontier model can work with what you have. Reserve your data engineering capacity for the cases where quality is non-negotiable.

Invest in question formulation, not just data infrastructure. The bottleneck is shifting from “we can’t access the data” to “we don’t know what to ask.” Hire people who understand the domain deeply enough to ask questions that models can’t generate on their own.

Think about data as a validation asset, not a training asset. Your proprietary data may be less valuable for teaching models new things and more valuable for confirming or refuting what models already believe. That’s a different value proposition, and it requires different infrastructure.

Accept that data advantages are becoming time-limited. Whatever edge your data gives you today will be smaller in 18 months. Extract value now, but don’t build your entire strategy around a depreciating asset.

Build AI-native analysts, not just AI tools. This is the part most organizations are getting wrong. They’re buying platforms and building chatbots when the real shift is a domain expert with a frontier model as an exoskeleton. I’ve written about this framing before: AI isn’t your coworker, it’s your exoskeleton. It amplifies what you already know how to do.

A clinical pharmacologist with an agentic coding environment pointed at raw trial data can do in hours what used to take a cross-functional team weeks. Not because the model replaces the pharmacologist’s judgment, but because it handles the mechanical work (parsing, transforming, visualizing, iterating) while the expert focuses on what they’re actually good at: knowing which questions matter, recognizing when results don’t make sense, and deciding what to do next.

Couple that domain expert with an agentic ecosystem, orchestration tools that let them string together data extraction, analysis, and reporting into flows they control, and you’ve got something genuinely new. Not a data scientist who codes. Not an engineer who understands biology. An AI-native practitioner who uses frontier models the way a previous generation used spreadsheets: as a thinking tool, not a product someone else built for them.

The investment case here isn’t “buy an AI platform.” It’s “upskill your domain experts to actually use frontier models in agentic workflows.” Teach your scientists to orchestrate. Give your clinical teams tools that let them go from raw data to insight without a three-month detour through data engineering. The organizations that do this will compress timelines from months to hours. The ones that don’t will keep filing tickets with the data team and waiting.

Rethink your data teams accordingly. The ratio of data engineers to data scientists to AI engineers needs to shift. Fewer people cleaning and transforming. More people formulating hypotheses and evaluating outputs. But the bigger shift is this: some of the most valuable “data people” in your organization won’t come from your data team at all. They’ll be the domain experts who learned to wield these tools themselves.

The Uncomfortable Truth

Here’s what I keep coming back to. We didn’t waste the last decade on FAIR data. Those investments were necessary, and they continue to matter for regulatory and operational use cases. But we did build an organizational muscle memory around a specific workflow: clean the data, then analyze it. And that workflow is becoming optional for a growing number of use cases.

The data paradox isn’t that clean data is worthless. It’s that the threshold for “clean enough” keeps dropping, while the unique value of any single dataset keeps shrinking against models that have seen everything.

The organizations that navigate this well will be the ones that can hold two ideas simultaneously: data quality still matters for some things, and data quality is becoming irrelevant for others. The ones that struggle will be the ones that can’t let go of a decade of institutional commitment to a paradigm that’s shifting under their feet.

The question isn’t whether your data is clean. The question is whether your data tells the model something it doesn’t already know. And that question gets harder to answer every six months.

The Leapfrog

Justin Johnson — Mon, 23 Mar 2026 11:03:38 GMT

I’m sitting in a hotel lobby in China, watching a woman at the next table dictate tasks to her phone in Mandarin. She’s not using Siri. She’s not using ChatGPT. She’s talking to an OpenClaw agent running on a Mac Mini back in her apartment. It orders groceries, summarizes her team’s WeChat messages, drafts a report. All before she’s finished her coffee.

This isn’t a tech demo. This is Tuesday.

And it tells you more about the future of AI than any GTC keynote or product launch.

I’ve Seen This Before

Walking around China this week, OpenClaw is everywhere. Not just among developers. Business owners running inventory agents. Students with research assistants. A restaurant manager whose agent handles reservations, supplier emails, and daily P&L summaries through a single Telegram thread.

SecurityScorecard reported this month that China-based OpenClaw usage has already surpassed the United States. Tencent, Alibaba, and Baidu are hosting public meetups to help everyday users get set up. There’s a buying frenzy for used Macs because OpenClaw works best on Apple hardware.

This is a pattern I recognize. We’ve seen it before.

Landlines to mobile (1990s-2000s). China never built out extensive landline infrastructure. When mobile arrived, there was nothing to skip from. Mobile penetration went from 7% to 90% in thirteen years. The West spent decades building copper networks. China went straight to wireless.

Cash to mobile payments (2010s). No entrenched credit card infrastructure to protect. No Visa and Mastercard lobbying to slow things down. QR codes, WeChat Pay, and Alipay now handle 340 trillion yuan annually. Roughly 80% of daily transactions happen on phones. I watched a street vendor selling dumplings who hasn’t touched cash in years.

Traditional SaaS to AI agents (now). No deep Salesforce, ServiceNow, or Microsoft 365 entrenchment in the everyday economy. So when OpenClaw showed up as a free, open-source agent that lives in the messaging apps people already use, there was no incumbent to defend. Just adoption.

Leapfrogging requires three conditions: absence of entrenched infrastructure, timing alignment with new technology, and coordinated adoption pressure. China has all three. Every time.

The countries that adopt new technology fastest aren’t the most advanced. They’re the ones with the least to protect.

In The Convergence, I wrote about OpenClaw’s heartbeat as the same pattern as Karpathy’s AutoResearch: wake up, check state, decide, act, go back to sleep. I was describing the technology. What I missed was the distribution. The technology is universal. The adoption isn’t.

Meanwhile, in San Francisco

While OpenClaw was going viral through group chats, Anthropic was doing something quieter. They were building the same thing, piece by piece, through a pipeline most people haven’t noticed.

The pattern: features debut in Claude Code (the terminal CLI for developers). Developers battle-test them. The features that survive get polished and pushed into Claude Co-Work (the desktop GUI for everyone). Co-Work launched in January 2026 as a research preview. By February, it had enterprise plugins, private marketplaces, and scheduled tasks.

Here’s the timeline:

Jan 2025: Claude Code launches. Terminal only. Developers only.
Jan 2026: Co-Work launches. GUI. Everyone with a paid plan.
Feb 2026: Remote Control. Scan a QR code, control your laptop from your phone. Your local environment stays local. Only conversation flows through the cloud.
Feb 2026: Channels. Telegram and Discord integration for Claude Code. Send it a message, it picks up the task, acts on your local machine, replies through the same channel.
Feb 2026: Enterprise expansion. Private plugin marketplaces, domain-specific templates, scheduled recurring tasks.
Mar 2026: /loop command. Session-level task scheduling in plain English. “Check the deploy every 5 minutes.” “Run tests hourly and post results to GitHub.”
Mar 2026: Voice mode. 1M token context window.

Now line that up against what OpenClaw does:

Feature by feature, Claude has replicated OpenClaw’s core capabilities. The difference is the wrapper. OpenClaw is open, flexible, and model-agnostic. Claude is closed, enterprise-safe, Anthropic-only, and backed by a company valued at $380 billion.

Anthropic isn’t building an OpenClaw competitor. They’re building what OpenClaw would look like if it had $10 billion in funding, enterprise security requirements, and a legal team.

In January, I wrote that Co-Work was the iPhone moment. The technical capabilities existed. Power users had figured out the workflows. The interface unlocked mass adoption. Two months later, the Channels feature proved the thesis: the interface for agents isn’t a terminal. It’s the messaging apps you already use.

Jensen Saw It Coming

GTC 2026. March 18. The day before I started writing this.

Jensen Huang announces NemoClaw: a reference stack making OpenClaw “enterprise ready.” Policy enforcement, network guardrails, privacy routing, all deployed through NVIDIA’s OpenShell runtime.

His line: “Every single company in the world today has to have an OpenClaw strategy.”

This is the CUDA playbook, running for the fourth decade:

Open-source captures grassroots adoption (OpenClaw)
Enterprise wrapper captures corporate adoption (NemoClaw)
Infrastructure layer captures margin (NVIDIA GPUs)

Jensen doesn’t care whether you use Claude or OpenClaw. He cares that you need GPUs to run either one. The three-layer stack is crystallizing:

Foundation models: Claude, GPT, DeepSeek, Qwen (the brains)
Agent frameworks: OpenClaw, Claude Code/Co-Work, Manus (the hands)
Infrastructure: NVIDIA, cloud providers (the platform)

Jensen is positioning NVIDIA to own layer 3 regardless of who wins layers 1 and 2. Same play as CUDA. Own the reference implementation, own the infrastructure demand.

The smartest move in the agent war wasn’t building an agent. It was building the platform every agent runs on.

The Year of the Personal Agent

2025 proved capability. Claude Code, GPT-5, Opus 4.5. For the first time, a single model could plan multi-step tasks, execute across domains, and iterate without human intervention. The question that year was simple: how smart can we make one model?

2026 changed the question. Not “how smart is the model?” but “how well does the agent know you?”

The evidence landed all in the same month. March 2026:

OpenClaw: 250,000 GitHub stars. Acquired by OpenAI.
Claude Co-Work: Enterprise plugins. Scheduled tasks. Private marketplaces.
Claude Code: Channels, Remote Control, /loop, voice mode.
NemoClaw: NVIDIA’s enterprise OpenClaw wrapper.
Perplexity Personal Computer (Mar 11): Always-on agent running on a Mac Mini.
Meta Manus My Computer (Mar 16): Desktop app for Windows and macOS.

All five major AI companies pivoted to local or hybrid personal agents in the same month. That’s not coincidence. That’s a market signal. Gartner predicts 40% of enterprise apps will feature task-specific AI agents by end of 2026, up from less than 5% in 2025.

The threads from my previous posts converge here:

In The Overnight Loop, I wrote that the loop itself is infrastructure. Try, measure, learn, repeat. The pattern works on GPUs, landing pages, molecular design. Anywhere you have something to change and a number to check. Personal agents are that loop, running on your machine, with your data, optimizing for your priorities.

In The Convergence, I wrote about small, proven components composing into systems that compound. A cron job plus a language model plus markdown files equals a personal agent that never sleeps. The composition is the breakthrough, not any individual component.

In Every AI Agent Is Missing Its Dopamine, I argued the next frontier isn’t more tools or faster models. It’s judgment. The continuous, adaptive sense of what matters right now, given everything else going on.

Personalized agents are where all three converge. Your agent runs your loops. On your machine. With your judgment about what matters.

2025 asked “how smart is the model?”
2026 asks “how well does the agent know you?”

The Leapfrog

Back to the hotel lobby. The woman with the coffee.

She didn’t evaluate Claude vs. OpenClaw vs. Manus. She didn’t read comparison articles on DataCamp. She opened WeChat, saw that her friend had set up an agent, and did the same thing. The distribution channel was a group chat. The onboarding was a QR code. The result was a personal agent running on a Mac Mini she bought used.

That’s the leapfrog. Not better technology. Better distribution.

China’s relationship with new technology is fundamentally different from the West’s. Techno-optimism isn’t a subculture here. It’s mainstream. There’s no “are AI agents going to take my job?” discourse in the coffee shop. There’s “which agent setup are you running?” The energy is practical, not anxious. The adoption is social, not institutional. One person sets it up, shows three friends, and by next week the whole office has agents running through their messaging apps.

Anthropic is building the most capable, most secure agent platform in the world. It’s genuinely impressive engineering. But they’re distributing through enterprise sales cycles, SOC 2 compliance reviews, and private plugin marketplaces. That’s the credit card play: technically superior infrastructure, gated by process.

OpenClaw is distributing through WeChat group chats and Telegram communities. That’s the QR code play: good enough technology, zero friction distribution.

The lesson from China’s last two leapfrogs: the technology that wins isn’t the one that’s most capable or most secure. It’s the one that matches the distribution architecture people already use.

America built the best credit card infrastructure in the world. China skipped it.

America is building the best enterprise AI agent infrastructure in the world.

I’m watching what comes next from a hotel lobby in China.

The future of AI agents isn’t being decided in boardrooms or keynotes. It’s being decided in group chats.

Every AI Agent Is Missing Its Dopamine

Justin Johnson — Mon, 16 Mar 2026 12:18:17 GMT

I was browsing arXiv at midnight, as one does, when I stumbled across a 45-page paper claiming to unify the entire brain into a single operational theory. From a university I’d never heard of. In Romania.

My first reaction was skepticism. My second reaction, about ten pages in, was: wait, this maps onto something I’ve been trying to articulate for months.

The paper is called “The DIME Architecture” (arXiv:2603.12286), and whether or not it’s right about the brain, it gave me the cleanest vocabulary I’ve found for a gap that’s been bothering me since I started building agents seriously. A gap that, once you see it, you notice in every agent framework, every autonomous loop, every production system shipping today.

Three things work. One thing is completely missing.

A Weird Paper from Romania

The paper comes from the University of Craiova. Five authors: an electrical engineer, a robotics researcher, an anatomist, a physiologist, and a clinical psychiatrist. Not the usual suspects for a theory-of-everything paper. No affiliation with DeepMind, no backing from a major research lab, no previous citations I could find.

And yet the framing is genuinely sharp.

Their argument: all cognition, from recognizing a face to planning a vacation to having a moment of creative insight, runs on one four-step cycle. They call it DIME.

StepWhat It DoesBrain System Behind ItDetectMatch incoming signals against known patternsPredictive coding (the brain constantly predicting and catching surprises)IntegrateFold new information into your ongoing mental contextMemory engrams (those cell assemblies Tonegawa won the Nobel Prize for)MarkAssign value, urgency, and importance to everything currently activeNeuromodulation (dopamine, serotonin, noradrenaline, the amygdala)ExecuteAct on whichever threads carry the highest valueMotor output, behavior, internal simulation

The cycle runs continuously. At every scale. The same loop that processes a flash of light in your visual cortex over milliseconds also runs across hours when you’re consolidating a memory during sleep. Different cognitive functions, memory, perception, planning, even consciousness, are just different configurations of the same four-step cycle running on different brain regions.

I want to be clear about something: this paper hasn’t been peer-reviewed. It has zero citations. The math is more sketch than model. The authors acknowledge all of this. The full theory lives in a companion monograph on Zenodo that I haven’t read yet.

But as a lens for organizing what we know about brains and what we’re building in AI, it clicked for me immediately.

Three Out of Four Ain’t Bad. Except It Is.

Here’s the thing that hit me when I mapped DIME onto the agentic AI landscape.

We built three of the four steps. The entire industry built three of the four steps. And then we stopped.

Detect: Done. MCP connects agents to any tool, any API, any data source you can think of. It hit 97 million monthly SDK downloads. Browser agents read web pages. Code agents parse error logs and test output. Event listeners catch webhooks. The problem of “notice that something happened” is solved.

Integrate: Getting there. MemGPT gives agents a dual-tier memory system that works roughly like how your hippocampus talks to your cortex. DeepSeek literally named their sparse memory module “Engram” after the neuroscience concept. RAG systems retrieve relevant context. Agent skills load modular capabilities on demand. Context windows stretch to a million tokens. Not perfect. But real and improving fast.

Execute: Done. Claude Code writes multi-file patches and runs tests autonomously. It scores 79% on SWE-Bench, meaning it can solve four out of five real GitHub issues. Codex runs parallel tasks. Tool calling is standardized. Agents send messages, create pull requests, query databases, browse the web. The “do the thing” problem is solved.

Mark: Nobody built this.

Every agentic framework in production today goes directly from “here’s what I know” to “here’s what I’ll do.” The step that asks “does this matter, and how much, and compared to what?” is either missing or hardcoded by a human.

Let me make this concrete, because “missing value layer” sounds abstract until you see it in practice.

You’re running multiple agents. One monitors your deployment infrastructure. One tracks customer feedback. One watches competitor activity. One manages your calendar and email. They all produce output. Who decides which output deserves your attention right now? Currently, that’s either you (reading everything), a priority system you hand-built (brittle, can’t adapt), or you ask the LLM “is this urgent?” (no persistent state, no memory of what was urgent yesterday, recomputes from scratch every time).

Or think about OpenClaw, which 300,000 people use. Its heartbeat wakes the agent every thirty minutes to check a Markdown checklist. It works. But the checklist is static. A human wrote it. It can’t distinguish between “your production database is unreachable” and “someone posted in a low-priority Slack channel” except through rules someone anticipated in advance. If the situation changes, the rules don’t.

That’s the gap. Not execution capability. Not tool access. Not memory. Judgment. The continuous, adaptive sense of what matters right now, given everything else that’s going on.

What Your Brain Does That Your Agent Doesn’t

The neuroscience here is genuinely interesting, and it got a lot more interesting in 2025.

Dopamine does way more than you think. Most people know dopamine as the “reward chemical.” Feel good, get dopamine. That’s the pop science version, and it’s wrong. Two papers published last year expanded the picture significantly. A Nature paper showed that dopamine in one part of the brain encodes “action prediction errors,” essentially a teaching signal about what actions lead where, completely independent of whether those actions feel good. A Science Advances paper showed dopamine firing for completely neutral, valueless stimuli. Not reward. Not punishment. Just: “this was unexpected, pay attention.”

The marker system in your brain isn’t about pleasure and pain. It’s about what to learn from. What to consolidate. What to amplify and what to let fade.

The consciousness selection problem is still wide open. Nature published a landmark study in 2025: a seven-year adversarial collaboration in which the proponents of the two leading theories of consciousness designed experiments together to test which theory would win. Two hundred fifty-six participants. Three types of brain imaging. Preregistered predictions.

The result? Neither theory fully worked. Both got some things right. Both failed on key predictions. And the piece that neither theory could explain is exactly the piece DIME calls “Mark”: the selection mechanism. How does the brain decide which of the thousands of things it’s processing right now gets promoted to conscious awareness?

Your memories are value-filtered. Tonegawa’s work on memory engrams showed that memories stored in hippocampal cell assemblies can be reactivated by partial cues. But here’s the thing: not all memories survive. The ones that get tagged with emotional weight by the amygdala, the ones encoded during high-dopamine states, those consolidate. The rest decay. Memory isn’t a recording. It’s an editorial process, and the editors are your neuromodulatory systems.

Your brain doesn’t ask “is this important?” after the fact. It runs a continuous, multi-dimensional value signal alongside every computation. Dopamine says “that was surprising, learn from it.” Serotonin says “stay patient, you’re on a good trajectory.” Noradrenaline says “uncertainty is high, widen your search.” The amygdala says “this has emotional weight, consolidate it.” These signals aren’t bolted onto cognition. They shape it in real time.

This is what DIME formalizes as the “marker field.” Not a post-processing step. A parallel computational stream that runs alongside everything else, continuously modulating which signals get amplified and which get suppressed. Value as an intrinsic property of every computation, not an external reward you bolt on after the fact.

So I Asked Claude to Build the Experiment

This is the part where things get a little surreal if you haven’t been paying attention to what AI coding tools can do now.

I described the experiment I wanted to run: four specialist AI agents, each optimizing a different aspect of a machine learning problem, with neuromodulatory control signals and a shared global workspace with value-weighted competition. Three experimental conditions. Logging, analysis, visualization.

Claude built the entire thing in ten minutes. Nineteen hundred lines of Python. A MarkerSystem class with four signals (dopamine, serotonin, noradrenaline, amygdala). A GlobalWorkspace class that receives broadcasts from agents, scores them with marker-weighted composites, and promotes the top findings while suppressing noise. Four specialist agents. A configurable CNN for CIFAR-10. An analysis pipeline that generates publication-quality plots. Documentation. A test suite.

Ten minutes. For an experiment that would have taken me a solid week to code by hand.

I’m telling you this not to brag about the tooling (though it is still wild to me), but because it illustrates exactly the point of this article. The Execute step in AI is incredible right now. Building things is fast. But knowing what to build, what to prioritize, which experiment to run next? That’s still on me. The agents that built the code have no opinion about whether this experiment is worth running. They just do what they’re told, perfectly and quickly.

That’s the missing Mark step, showing up in the tools I used to study the missing Mark step.

What the Experiment Tests

The setup is straightforward. Four specialist agents running on a GPU, each responsible for one dimension of a machine learning optimization problem: architecture search, hyperparameter tuning, data augmentation strategy, and regularization.

Three conditions:

Independent. Four agents, each running its own loop. No communication. Best result from any agent wins. This is most agent systems today: capable but isolated.

Naive sharing. All agents share everything with all other agents. Every finding goes into every context. No filtering. This is the “more information is always better” assumption. It’s also how most multi-agent systems actually coordinate: dump everything into a shared state file and hope for the best.

Full DIME. Each agent gets marker signals. Dopamine fires on surprising results, widening the exploration radius. Serotonin rises during improving trends, encouraging the agent to refine rather than restart. Noradrenaline spikes during high uncertainty, pushing the agent to try something fundamentally different. And an amygdala signal fires on breakthroughs, locking in the finding and broadcasting it with high priority.

All four agents share a global workspace. Findings compete for attention based on their value scores. A selector promotes the top three and suppresses the rest. Promoted findings get injected into every agent’s context. Suppressed findings get archived as one-line summaries.

Detect. Integrate. Mark. Execute. Running on a GPU for a few days.

The hypothesis is simple: DIME should beat naive sharing, and naive sharing should beat independence. Because intelligent, value-weighted selection should outperform both flooding agents with everything and giving them nothing.

Why This Matters If You’re Building with AI

I’ve spent the last year writing about the patterns underneath agentic AI. In Running Loops at Midnight, it was the convergence: small, proven components composing into systems that compound. In 1,000 Small Bets, it was the strategy: bottom-up experimentation beating top-down transformation. In Delegation, Not Automation, it was the philosophy: AI as a collaborator, not a replacement.

All of that still holds. But there’s a ceiling, and I think the DIME framework points at it clearly.

Composition got us incredibly far. MCP gave us the USB-C for AI tool connections. Reasoning models gave us agents that can make decisions, not just generate text. Cost compression made it feasible to run loops continuously. Open source made the building blocks available to everyone.

But if you’re building agents for anything more complex than a single-purpose loop, you’ve hit the problem. Your agent can do a hundred things. How does it decide which thing to do right now? Your multi-agent system produces a firehose of output. How do you surface the signal without drowning in noise? Your research agent ran fifty experiments overnight. Which ones deserve a deeper look?

Right now, the answer is: you write rules. Or you add another LLM call. Or you just look at everything yourself.

The neuroscience suggests a different answer. Build a value layer. Not as an afterthought. As a parallel system that runs alongside every computation. Multi-dimensional (not just one metric). Continuous (not checked periodically). Adaptive (learns what matters based on outcomes, not just what you told it to care about).

The next frontier in agentic AI isn’t more tools or faster models. It’s judgment. And a weird paper from Romania gave me a clearer way to think about what that means.

Part 2 of this series will have the experiment results. Did adding synthetic dopamine and serotonin to AI agents change anything? Did the global workspace improve coordination? I genuinely don’t know yet, which is the best kind of experiment.

In the meantime, the DIME paper is at arXiv:2603.12286, and the experiment code will be open-sourced when it’s done.

The Overnight Loop

Justin Johnson — Sun, 15 Mar 2026 11:43:25 GMT

I said in The Convergence that Karpathy’s AutoResearch was “630 lines and a five-minute loop.” The concept was elegant. An AI agent modifies a training script, trains for five minutes, checks a metric, keeps or discards the change. Repeat. You go to sleep, and by morning it’s run a hundred experiments.

I said this pattern would matter. Then I did what I always do. I ran it.

What Happened Overnight

The setup was simple. My DGX Spark, a Blackwell GB10 GPU with 128 GB of memory, sitting on my desk. Claude Sonnet 4.5 running Karpathy’s code with two small modifications: Flash Attention 3 swapped for PyTorch SDPA (Blackwell doesn’t support FA3 yet), and the FLOPS constant corrected from the H100’s 990 to the GB10’s measured 213. That’s it. Two lines changed.

I started a two-hour session in the afternoon. Eighteen experiments. The agent immediately started shrinking things. Smaller batches. Shallower models. By the time I checked, it had already improved the baseline by 20%.

So I let it run overnight.

Sixteen hours later: 151 completed experiments. Twenty-six improvements kept. 122 ideas discarded. Three crashes. And a final result that cut the validation metric by 22.5%.

But the number isn’t the story. The discovery is.

The agent had 128 GB of GPU memory available. It chose to use 6.1 GB. Not because it couldn’t use more. Because using more made things worse.

The conventional wisdom in GPU computing is straightforward: bigger GPU, bigger models, more data per step. That logic works on high-end hardware pushing 990 TFLOPS. The GB10 pushes 213. In a five-minute training window, that difference changes everything.

With the H100’s recommended configuration, the GB10 could only run 93 training steps. Not enough to learn anything useful. So the agent adapted. It cut the model in half. Shrank the batch size by 8x. Each reduction freed compute for more training steps. The final configuration ran about 1,300 steps in five minutes. Fourteen times more learning iterations.

The agent didn’t need my expertise to figure this out. It just needed the loop and five minutes at a time.

Three independent groups ran AutoResearch on the GB10. Nobody coordinated. All three found the same thing: smaller models, more steps, less memory. The physics forced convergence.

Hardware determines optimal architecture. You can’t copy someone else’s GPU configuration and expect the same results. Each platform has its own sweet spot, and the only way to find it is to run the loop.

I wrote the full technical deep-dive on my tech blog, with all the benchmarks, phase analysis, and code details. The full code, all 151 experiment logs, and configuration files are on GitHub. What I want to talk about here is the pattern.

The Pattern That Works on Everything

Try, measure, learn, repeat. No human in the loop. Time-boxed cycles. A scalar metric to optimize. An editable asset to modify.

That pattern doesn’t require a GPU. It doesn’t require machine learning. It requires three things: something you can change, a number that tells you if the change was good, and a clock.

People are already running this loop on things that have nothing to do with model training.

GPU kernel optimization. AutoKernel applies the same pattern to performance-critical code. Given a model, the agent profiles for bottlenecks, extracts each kernel, then runs the loop: edit, benchmark, keep or revert. It uses Amdahl’s law to prioritize by impact, so a 1.5x speedup on the code that runs 60% of the time beats a 3x speedup on code that runs 5%.

Frontend performance. pi-autoresearch runs the loop on Lighthouse scores, bundle size, and build times. Point it at a JavaScript project and it starts optimizing. It includes correctness checks after every pass to prevent “optimizations” that break things.

Marketing. Eric Siu, founder of Single Grain, applied the pattern to landing pages and cold emails. The agent modifies variables (subject line, CTA, headline), measures positive reply rate, keeps or discards. His argument: most marketing teams run about 30 experiments per year. An overnight loop runs hundreds.

Algorithm discovery. Google DeepMind’s AlphaEvolve pairs Gemini with automated evaluators and evolutionary selection. It discovered a matrix multiplication algorithm that improved on Strassen’s 1969 result. It found better data center scheduling that recovered 0.7% of global compute. Same loop. Code, evaluate, select, repeat.

Scientific discovery. Self-driving labs in chemistry and materials science are running autonomous experiment loops where the “code” being edited is the experimental protocol. A robotic system proposes an experiment, executes it, analyzes results, and updates its hypothesis. SAGA goes further: the outer loop formulates new objectives while the inner loop optimizes under the current one. The agent itself designs the scoring function. NC State researchers recently demonstrated this for materials discovery, calling it “fast forward” for the field.

The pattern is always the same. An editable asset, a scalar metric, and a time-boxed cycle. Change something. Measure it. Keep or discard. Repeat until the clock runs out.

Karpathy framed AutoResearch as ML research automation. But the community has already generalized it. The training script is just the first asset people thought to optimize. The loop works on anything with a feedback signal.

Why Now

This pattern isn’t new. Reinforcement learning has been doing try-measure-learn for decades. Control theory before that. What changed is that the “decide what to do next” step is now handled by language models that are good enough, cheap enough, and fast enough to make the loop practical for everyday problems.

Six months ago, the pieces existed independently. Better reasoning models. Cheaper inference. Tool use through protocols like MCP. Each one generated its own hype cycle. What AutoResearch and its variants show is what happens when you stop admiring the pieces and start composing them.

A training script plus an LLM loop equals a research assistant that runs 151 experiments overnight. A landing page plus a metric plus Claude equals a marketing team that tests more variants in one night than most teams test in a year. The composition is the breakthrough, not any individual component.

In Your AI Strategy Should Be 1,000 Small Bets, I wrote that bottom-up experimentation beats top-down transformation. That when you remove friction and let people experiment, the results surprise you. AutoResearch is that thesis running autonomously. The agent makes small bets. Hundreds of them. Most fail. The ones that work compound.

What the Agent Can’t Tell You

151 experiments against the same validation metric. The community has raised valid concerns about overfitting to quirks in the data, and they’re right to ask.

The mitigations are real but incomplete. Five-minute training budget limits the search space. The changes are architectural, not per-sample. 22.5% is too large to be pure noise. Three independent groups converging on the same strategies adds external validation. But would the gains transfer to an unseen test set? To a different dataset entirely? Nobody running AutoResearch right now can answer that definitively.

The hardware insight, though, is physics. 213 TFLOPS is 213 TFLOPS regardless of your validation set. The discovery that hardware constraints determine optimal architecture isn’t an artifact of overfitting. It’s an artifact of running the experiment on actual hardware.

The Loop as Infrastructure

Six months ago, in Compound Velocity, I wrote about small experiments compounding into something larger than any individual result. AutoResearch is that pattern, automated.

If every new GPU architecture needs its own optimization, and if autonomous agents can discover those optimizations overnight, then the loop itself becomes infrastructure. Not the results of any particular run. The capability of running the loop at all.

GPU manufacturers ship hardware. The community runs loops. Optimal configurations emerge. This already happened three times independently for the GB10 alone. Three groups found the same fundamental pattern (smaller, shallower, more steps) without coordinating. The full code and all 151 experiment logs are on GitHub. Anyone with a GPU can clone, reproduce, and compare.

Karpathy has talked about wanting “massively asynchronous collaborative AI agents” for research, something like SETI@home for ML optimization. We’re not there yet. But the pieces exist. The loop runs. The results converge.

OpenAI published a self-evolving agents cookbook describing the same core pattern for production systems: automated retraining loops with LLM-as-judge evaluation. Their use case was pharmaceutical regulatory documents, not GPU training. Same loop. Different asset.

This is where it connects to the broader shift I’ve been writing about. In Delegation, Not Automation, I argued that the future of AI isn’t replacing humans. It’s giving humans the ability to delegate work that was previously too tedious, too slow, or too repetitive to bother with. Nobody was going to manually run 151 training experiments overnight. The work just wouldn’t get done. The agent doesn’t replace an engineer. It runs the experiments no engineer would have time for.

The pharma angle is obvious. Drug discovery is already moving toward autonomous experiment loops. Self-driving labs propose hypotheses, run assays, analyze results, and iterate. The overnight loop is the software equivalent. And in an industry where a single clinical trial costs $50 million and takes years, the ability to run hundreds of cheap experiments overnight to narrow the search space before committing resources changes the economics of R&D.

The future of hardware optimization isn’t a paper. It’s a cron job. Ship a new GPU, run the loop overnight, publish the results by morning.

The Tight Loop

In The Convergence, I ended with a line about small bets compounding quietly. I wrote that the pattern is always the same: try, measure, learn, repeat. Shared generously.

Then I went to bed and let an agent prove it. 151 times.

The overnight loop isn’t a breakthrough in machine learning. It’s a proof that the pattern works. On GPUs, on kernels, on landing pages, on molecular design. Anywhere you have something to change, a number to check, and the patience to let the clock run.

The agent didn’t need my expertise to discover that the GB10 is step-limited. It didn’t need my intuition about batch sizes or model depth. It just needed the loop and five minutes at a time.

That’s what makes this moment different from every other AI hype cycle. Not the models. Not the benchmarks. The loops. Small, patient, autonomous loops running while the rest of us sleep.

Running Loops at Midnight

Justin Johnson — Fri, 13 Mar 2026 09:14:17 GMT

Six months ago, I wrote that our greatest tool is still each other. I was right. But the world around that truth looks nothing like it did.

George Hotz published a blog post this week with the title “Every minute you aren’t running 69 agents, you are falling behind.” If you only read the headline, you’d think it was another entry in the LinkedIn anxiety machine. The endless scroll of posts telling you that if you’re not using the latest AI tool, you’re already obsolete.

But geohot’s actual argument is the opposite. The title is satire. His point: the pressure is manufactured. AI is “just search and optimization” with inherent computational limits. And the path forward isn’t panic. It’s creating more value than you consume.

I agree with him. But I also think something genuinely important happened in the six months since I wrote In an Age of AI, Our Greatest Tool is Still Each Other. Not the kind of important that LinkedIn influencers want you to believe. Not “you’re falling behind” important. The kind of important that only becomes visible when you stop scrolling and start building.

The Six Months That Changed Everything

Let me just lay out what happened between September 2025 and March 2026. Because when you see it compressed into a list, the velocity is staggering.

Claude went from Sonnet 4.5 to Opus 4.6, with context windows expanding to one million tokens. GPT moved through 5.0, 5.1, and 5.2, plus a specialized Codex variant. Google shipped Gemini 3. DeepSeek released models that matched US proprietary APIs at a fraction of the cost, and their app hit #1 on the US App Store in January. Nine of the top ten open-weight models globally now come from China.

Anthropic’s valuation went from $61.5 billion to $380 billion. Cursor, a coding editor most people hadn’t heard of a year ago, hit a $29.3 billion valuation with $1.2 billion in annual revenue. The Model Context Protocol (MCP), an open standard for connecting AI to tools, reached 97 million monthly SDK downloads and 5,800 community servers.

Forty-one percent of all production code is now AI-generated. METR published research showing AI task capability doubles approximately every seven months. Isomorphic Labs released what researchers are calling “AlphaFold 4,” cutting drug discovery timelines by 70%.

The EU AI Act moved from theoretical future concern to active enforcement, with real fines for non-compliance. Reasoning models like o4-mini scored 93.4% on competition math. Claude’s API costs dropped 67% while its performance improved across every benchmark.

All of this. Six months.

Two Things Worth Paying Attention To

Underneath the model releases and valuation headlines, two things happened that I think will outlast all of them. Not because they represent some technical breakthrough. Because they represent something more fundamental: convergence.

The first is OpenClaw. The second is AutoResearch.

Neither of these is a new model. Neither required a billion-dollar GPU cluster. Neither came from a research lab with a hundred PhDs. And yet, when I look at where AI becomes useful (not impressive, useful), these two projects tell a bigger story than any frontier model announcement.

OpenClaw: A Cron Job, a Markdown File, and a Messaging Gateway

OpenClaw is an open-source autonomous AI agent created by Peter Steinberger. On the surface, it’s straightforward: a bot that runs on your messaging apps (Telegram, WhatsApp, Discord) and can do things for you, using whatever LLM you choose.

But the interesting part isn’t what it does. It’s the architecture. Five components: a gateway for routing messages, a brain for LLM calls, memory stored as plain Markdown files on disk, a plugin system of community skills, and the piece that makes it all work: the heartbeat.

The heartbeat is a cron job. Every thirty minutes, it wakes the agent up. The agent reads a checklist (a Markdown file called HEARTBEAT.md). It decides: does anything need my attention right now? If yes, it acts. If no, it responds with HEARTBEAT_OK, and the system suppresses the message. Nobody gets bothered.

That’s it. That’s the innovation. A timer, a checklist, and a decision loop.

And somehow, this simple pattern produced something that 5,700 community contributors have built skills for. Something that runs 24/7 on a cheap server, checking your inbox, monitoring your deployments, summarizing your feeds, following up on tasks you forgot about.

The reason it works isn’t the LLM. The LLM is a commodity now. The reason it works is that someone composed a handful of simple, well-understood components (cron scheduling, markdown persistence, messaging APIs, a ReAct reasoning loop) into a system that compounds over time.

I wrote about this pattern six months ago in The Context Graph. The shift from systems of record to systems of agents. Where the value isn’t in storing data but in capturing the reasoning behind decisions. OpenClaw’s memory is exactly that: plain text files that grow richer with every interaction, every heartbeat, every decision the agent makes.

AutoResearch: 630 Lines and a Five-Minute Loop

Andrej Karpathy released AutoResearch in March 2026. It’s a 630-line Python script. One file. It does one thing: lets an AI agent run autonomous machine learning experiments on a single GPU.

The loop is almost comically simple. The agent modifies a training file. It trains a model for five minutes. It checks the validation metric. It decides what to try next. It repeats. You go to sleep, and by morning, it’s run a hundred experiments.

The repository hit 8,000 GitHub stars in days.

“The goal is to engineer your agents to make the fastest research progress indefinitely and without any of your own involvement.” That’s Karpathy’s description. Not a research paper. Not a framework with 47 dependencies. A single file and a clear loop.

What strikes me about AutoResearch isn’t the code. It’s the philosophy. Karpathy stripped everything down to the essential pattern: try, measure, learn, repeat. No human in the loop. No complex orchestration. Just a tight cycle running all night.

I’ve been running a version of this pattern for months. ARIA, my autonomous research system, operates on the same principle: a flywheel with 14 possible actions, scoring every idea on five dimensions, routing tasks to the right model (fast models for validation, capable models for creative work, the best model for code). Over 5000 sessions. 100 active ideas. 250 completed experiments with real data, now.

AutoResearch makes this accessible to anyone with a GPU and a Python environment. That matters.

The Pattern Underneath

Here’s what I want you to see. OpenClaw’s heartbeat and AutoResearch’s training loop are the same pattern. Wake up. Check the state. Decide what to do. Act. Go back to sleep.

This isn’t a new idea. Reinforcement learning has been doing this for decades. Control theory before that. What’s new is that the “decide what to do” step is now handled by language models that are good enough, cheap enough, and fast enough to make the pattern practical for everyday problems.

Six months ago, the pieces existed independently. Better reasoning models. Cheaper inference. Tool use through MCP. Persistent memory. Open-source skills. Each one was impressive on its own, and each one generated its own hype cycle.

What OpenClaw and AutoResearch show is what happens when you stop admiring the pieces and start composing them. A cron job plus a language model plus markdown files equals a personal agent that never sleeps. A training script plus an LLM loop equals a research assistant that runs a hundred experiments overnight.

The convergence isn’t about any single technology. It’s about composition. Small, proven components assembled into systems that compound.

I wrote about this in Your AI Strategy Should Be 1,000 Small Bets. The thesis was that bottom-up experimentation beats top-down transformation. That the real bottlenecks aren’t technical (access, permission, and culture are what hold people back). That when you remove friction and let people experiment, 480 participants will generate 40 solutions you never planned for.

What I didn’t anticipate was how fast the bets would start converging. The community skills in OpenClaw didn’t come from a product roadmap. They came from 5,700 people scratching their own itches. AutoResearch didn’t come from a funded research program. It came from one person who wanted his GPU to be useful while he slept.

The Builder’s Moment

Six months ago, I wrote about the 1:N effect: one person with AI collaborators producing output that used to require a team. That was true then. It’s more true now, but the nature of it has shifted.

It’s not just that the tools are faster. It’s that they’re composable. You can wire a heartbeat loop to a research agent to a memory system to a messaging gateway, and the whole thing runs while you’re having dinner with your family. Not because any one component is magical. Because the interfaces between them finally work.

MCP gave us the USB-C for AI connections. Reasoning models gave us agents that can make decisions, not just generate text. Cost compression made it feasible to run these loops continuously instead of rationing every API call. Open source made the building blocks available to everyone.

These aren’t things that happened because someone published a breakthrough paper. They happened because a thousand small bets, made by thousands of people working independently, started to rhyme.

Still Each Other

Here’s where I come back to geohot. And to the thing I wrote six months ago.

The people building these systems aren’t the ones panicking on LinkedIn. They’re not worried about falling behind. They’re too busy running loops. Small, patient, compounding loops. Building something, measuring it, learning from it, and sharing what they found.

I came across a piece on curiosity-driven self-education right before publishing this, and it stopped me. The research describes how people who teach themselves through curiosity develop what psychologists call “peripheral vision for problems.” They notice edges, context, contributing factors. They sit with confusion instead of reaching for predetermined frameworks. That’s the builder’s temperament. That’s the person running loops at midnight, not the person doom-scrolling LinkedIn at noon.

Karpathy open-sourced AutoResearch under an MIT license. Steinberger and 5,700 contributors built OpenClaw’s skill library in public. The MCP specification was donated to the Linux Foundation. These aren’t competitive moves. They’re acts of generosity that happen to also be good engineering.

The anxiety is misplaced. The future doesn’t belong to whoever runs the most agents. It belongs to whoever builds the tightest loops, and then shares them.

Six months ago, I wrote that in an age of AI, our greatest tool is still each other. That professionals trust their networks over algorithms. That human judgment, trust-building, and relationship quality remain competitive advantages that technology can’t replace.

Nothing that happened in the last six months changed that. If anything, the convergence reinforced it. The most important AI systems being built right now aren’t the ones with the biggest parameter counts or the highest benchmark scores. They’re the ones that amplify what people already do well: experiment, share, learn, and build on each other’s work.

The next six months will be faster. The models will be better. The costs will be lower.

The hype will be louder.

But the pattern won’t change. Small bets. Tight loops. Shared generously. Compounding quietly. That’s the convergence.

And it’s just getting started.

The Art of the Impossible

Justin Johnson — Tue, 03 Mar 2026 12:02:59 GMT

“The superpower of not knowing what can’t be done.” Jensen Huang has said some version of it. So has every founder who walked into an industry sideways and built something the veterans swore was impossible. The outsider who didn’t know hospitals don’t share data, so they built a platform that made them share it. The engineer who didn’t know regulatory environments crush the naive, so they just shipped and figured it out.

The conventional wisdom was wrong. Not because the veterans were stupid. Because they’d internalized constraints so deeply they’d mistaken them for physics.

This is the celebrated version. The outsider disrupts. Silicon Valley has told this story a thousand times. Ignorance of constraints as competitive advantage. The beginner’s mind as superpower.

But there’s another version of this story that almost nobody tells. It doesn’t have a clean narrative arc. There’s no dramatic founding moment, no IPO, no magazine cover. It happens quietly, over years, inside the very institutions the outsiders are disrupting.

It’s the story of the builder who stayed.

A note before we go further: this is not an autobiography. Some of it is true. Some of it is pattern-matched from a decade of watching builders collide with institutions. The story has been composited, generalized, and adjusted to fit your screen. If you recognize yourself in it, that’s the point. If you think it’s about one specific person, it isn’t.

Three Fates

When a builder enters a large organization, three things happen. Not might happen. Do happen. I’ve watched all three play out over more than a decade.

The first is absorption. The most common outcome, and it’s not a failure of character. It’s adaptation. Large organizations are optimized for consistency, and consistency rewards consensus. You learn to say “let’s align on the framework” instead of “let me build it.” You learn that the meeting about the meeting is where decisions actually get made.

You develop a sixth sense for organizational risk and a vocabulary for managing it. Year by year, the instinct to build something from scratch fades. Not because you lost it. Because the environment selected against it.

By year five, you run meetings about things you used to make.

The second is exit. The celebrated path. You leave. You start something. LinkedIn applauds. The implication, always, is that staying was the failure. That the smart ones get out. That large organizations are where builders go to die.

The third is resistance. This is the rarest, and it’s the one I want to talk about.

You stay. You keep building. Not in rebellion. You still lead the teams, attend the governance reviews, navigate the stakeholder landscape. You understand why the system works the way it does. You respect the clinical rigor, the regulatory constraints, the institutional knowledge that keeps patients safe. You are not at war with the organization.

But somewhere in the margins, you refuse to let the builder die. You prototype when you could write a requirements document. You write first code when you could assign it to someone three levels down. You build the thing that wasn’t supposed to be possible, and then you walk it into the room where people have been planning it for six months.

Not to embarrass anyone. Because working software changes the conversation from “should we?” to “how do we scale this?”

What Resistance Actually Looks Like

I want to be honest about what this is. It’s not heroic. It’s not romantic. Most of the time, it’s just relentless.

It’s the moment you realize you’ve spent four hours in sequential meetings about an AI initiative, and none of those meetings involved anyone building anything. So you go home and build the actual thing in an evening. Not because you’re smarter than the committee. Because a prototype answers questions that slide decks can’t.

It’s writing the first lines of code for a platform that your organization told you was too risky, too early, too ambitious. Not because you disagree with the risk assessment. Because you know that risk looks different when there’s a working system to evaluate instead of an abstract proposal.

It’s the loneliness of operating at a different clock speed than the institution around you. Not because you’re better. Because you’re wired to build, and the organization is wired to evaluate. Both are necessary. Only one has a lane.

And it’s the hardest skill of all: the handoff. You built it. You proved it works. Now let go. Give it to the engineers who will make it better than you ever could. Your job was never to own the thing. Your job was to make the impossible thing real enough that brilliant people could see it, believe in it, and make it legendary. Show them the art of the impossible. Then step back.

I want to be clear about something. The organization is not the enemy. Large institutions do things that no individual or startup can do. They run clinical trials that save lives. They operate at regulatory standards that genuinely matter. They coordinate thousands of people toward goals that require coordination.

The system isn’t broken. It’s just optimized for a different function than creation. It’s optimized for consistency, for risk management, for scale. Those are good things. They’re just not the same thing as building from zero.

The tension isn’t good versus evil. It’s two different metabolisms sharing one body.

The Pattern

Every builder who stays long enough develops the same pattern, whether they name it or not.

Find the whitespace. Not a complaint. A vision. The seam between what exists and what should exist. The gap that everyone walks past because it sits between two org charts, or two systems, or two assumptions that nobody thought to question at the same time.

Prototype alone. Don’t ask for permission, a budget, or a team. Just build the minimum version that proves the concept. Write the first code yourself. Not because you’re the best engineer in the room. Because the act of building is how you think. The prototype is your business case, your requirements document, and your proof of concept rolled into one artifact that people can touch.

Prove it works. Get it into someone’s hands. Let them use it. Let the usage make the argument that no presentation could.

Recruit believers. Not from the top down. From the ground up. The person who used your prototype and told three colleagues. The engineer who saw what you built and said “I can make that better.” The leader who saw adoption happening without a mandate and had the wisdom to fund it instead of fight it.

Hand off execution. This is where most builders struggle. The prototype is yours. The product is theirs. Let them own it. Let them rebuild the parts you hacked together. Let them add the governance and the monitoring and the documentation that production systems need. Your job was to prove the impossible. Their job is to make it inevitable.

Move to the next gap.

This is founder behavior inside an employee context. It’s why these people are simultaneously the most productive and the hardest to evaluate. They don’t fit in a box labeled “leader” or a box labeled “individual contributor.” They’re both. And most performance frameworks have no idea what to do with that.

AI Changed the Math

For decades, the builder inside the large organization was constrained by the same dependencies as everyone else. You need a team to build a platform. You need budget for infrastructure. You need months of procurement to get the tools. The instinct to build fast collided with the reality that building required resources that moved at institutional speed.

AI broke that constraint. Not gradually. Suddenly.

One person can now prototype in a day what used to take a team a quarter. I don’t mean a mockup. I mean a working system, backend and frontend, with documentation, ready for someone to evaluate. The gap between “I have an idea” and “I have a working version” collapsed from months to hours.

This changes everything for the builder who stayed. Especially in regulated industries.

In biopharma, finance, healthcare, the phrase you hear most often is “safe and responsible AI.” And it’s not wrong. These are domains where getting it wrong has real consequences: patient safety, financial exposure, regulatory action. The governance exists for reasons that matter.

But “safe and responsible” has a shadow meaning in most large organizations. It means slow. It means committee. It means the gap between a working prototype and an approved deployment can be measured in fiscal quarters.

The builder’s job isn’t to bypass that governance. It’s to compress the distance between “here’s an idea” and “here’s something safe enough to evaluate.” A working prototype with guardrails built in changes the risk conversation from theoretical to concrete. It’s easier to govern something you can see.

The outsider’s advantage was always speed. Move fast, unencumbered by process. The insider’s advantage was always knowledge. Deep understanding of the real problems, the actual workflows, the constraints that matter. But the insider could never move at outsider speed because the organization’s machinery stood between the idea and the prototype.

That machinery is now optional for the prototype stage. AI gives the insider-builder outsider speed while keeping insider knowledge. That’s a combination that didn’t exist before.

I wrote recently about why microinnovation beats transformation. The argument was about organizational strategy: enable 1,000 small bets instead of one big plan. But I left something out. Those 1,000 bets don’t make themselves. Someone has to be the first. Someone has to build the thing that proves the concept, negotiate the governance shortcut, create the space where others feel safe to experiment. Microinnovation at scale requires at least one person who was willing to microinnovate alone.

AI makes that first move radically easier. The builder who would have spent a month on a prototype can now spend a weekend. The proof of concept that would have required three engineers can now be built by one person who understands the problem deeply enough to describe it precisely.

The constraint that held these people back was never talent or will. It was the dependency on organizational resources to build the first version. That dependency is dissolving. And the people who feel it most acutely are the ones who’ve been waiting years for the tools to catch up to their instincts.

The Rethink

Every large organization has builders hiding in plain sight. They carry titles like “director” or “vice president” but they still write code on weekends. They build things in the margins that nobody asked for and that everybody ends up using. They’ve stayed when they could have left, not because they lack ambition, but because the problems inside the walls are genuinely interesting. Hard problems. Regulated problems. Problems that matter.

These people are not optimizing your existing systems. They’re showing you what your next systems look like.

But here’s the uncomfortable truth: most organizations don’t know who these people are. The ones who build from zero look, on paper, exactly like the ones who manage what exists. Same titles. Same meetings. Same org chart boxes. The difference is invisible until you look at what they’ve built, not what they’ve managed.

And the organizational instinct, when it does notice them, is often to promote them away from building. “You’re too valuable to write code. You should be leading strategy.” As if strategy and building are different activities. As if the person who built the impossible thing from scratch is better utilized approving someone else’s quarterly roadmap.

The rethink isn’t a restructuring. It’s a recognition. Find the builders who stayed. Understand that they’re operating with a pattern (see the gap, prototype, prove it, hand it off) that creates disproportionate value. And instead of promoting them into roles that extinguish the instinct, create space for the instinct to compound.

The organizations that figure this out will have an extraordinary advantage. Not because they hired better. Because they stopped accidentally suppressing the builders they already had.

The Art of the Impossible

The outsider’s superpower is not knowing what can’t be done. They walk in clean, unburdened by accumulated impossibilities, and they build what the veterans said couldn’t exist.

The insider’s superpower is knowing exactly what they said can’t be done, and building it anyway. They’ve heard every objection. They’ve sat through every governance review. They know the regulatory landscape, the data constraints, the organizational politics. And they build anyway. Not in ignorance of the constraints. In full awareness of them. Routing around what can be routed around. Respecting what must be respected. And proving, one prototype at a time, that the boundary between impossible and possible was never where everyone assumed.

One of them disrupts from the outside. The other transforms from the inside.

Both are practicing the same art.

The difference is that the outsider gets the magazine cover. The insider gets another meeting invite.

But the work is the same. The instinct is the same. The relentless refusal to accept that “this is how we’ve always done it” constitutes an argument is the same.

If you’re a builder who stayed, you already know everything I’ve written here. You’ve lived it. You’ve felt the pull of absorption and chosen resistance. You’ve built things that weren’t supposed to be possible and handed them to people who made them better than you imagined.

You don’t need a manifesto. You need to know you’re not alone.

There are more of us than the org charts suggest.

And the tools just caught up.

Your AI Strategy Should Be 1,000 Small Bets

Justin Johnson — Tue, 17 Feb 2026 11:18:56 GMT

Somewhere right now, an employee at a large organization is building something in two days that their company has been planning for six months.

Not a prototype. Not a demo. A working tool that pulls from internal data, automates an analysis that used to take weeks, and produces results good enough to act on. They’ll share it in a team channel. A dozen colleagues will adapt it within a month. Nobody will approve it. Nobody will fund it. It will spread because it’s useful.

The six-month project, by the way, is still in requirements gathering.

This pattern is playing out across every industry right now. Finance, healthcare, manufacturing, professional services, government. A single person with access to an AI endpoint and a real problem they care about will outpace an enterprise program with a budget, a timeline, and a steering committee. Not because the program is incompetent. Because the program is solving a different problem than the person at the keyboard.

The Speed Mismatch

Enterprise AI adoption has a structural timing problem. Gartner estimates that 87% of AI projects never make it past pilot stage. MIT Sloan found that 95% of generative AI pilots deliver zero measurable return on P&L. These aren’t failures of talent or intent. Most of the people involved, from internal teams to external advisors, are genuinely trying to do the right thing.

The average enterprise AI roadmap has a 12-18 month horizon. The average foundational model generation lasts about 6 months. The math doesn’t work.

The problem is structural. Traditional transformation programs are designed for technologies that move slowly: ERP migrations, cloud transitions, data warehouse modernizations. Those projects reward careful planning because the target holds still long enough to aim at it. AI doesn’t hold still. The models change, the capabilities expand, and the use cases that seemed theoretical six months ago become table stakes.

So organizations do what they’ve always done: they plan thoroughly, align stakeholders, build governance frameworks, run procurement. All of it reasonable. All of it necessary at some level. But the cumulative timeline means that by the time you’re ready to deploy, the landscape has shifted under you. Not because anyone made a mistake, but because the cadence of enterprise planning and the cadence of AI capability development are fundamentally mismatched.

The result is a familiar pattern: pilots that technically work but that nobody adopts at scale. Not because the technology failed, but because the window of relevance closed while the organization was still getting ready.

The Bottleneck Was Never the Model

Here’s what most AI strategies get wrong at a foundational level: they assume the hard part is the technology. It isn’t. Not anymore.

GPT-4 class models have been broadly available since early 2024. Claude, Gemini, Llama, Mistral, and dozens of others are accessible through APIs that cost pennies per call. Open-source models run on consumer hardware. The capability gap between “what AI can do” and “what most knowledge workers need AI to do” closed somewhere around mid-2024 and has been widening in the other direction ever since.

The actual bottlenecks are:

Access. Can your people get to an AI endpoint without filing three tickets and waiting two weeks? In most enterprises, no. The procurement process for an API key takes longer than training the model itself.

Permission. Do your people feel safe experimenting? Or does every AI use case require a risk assessment, a legal review, and sign-off from someone who doesn’t understand what they’re approving? Permission isn’t just policy. It’s culture. It’s whether someone feels they’ll be rewarded for trying something new or punished if it doesn’t work.

Culture. Do your people share what they build? Or do solutions die in individual notebooks, never seen by the ten other people who have the exact same problem? The difference between a company where AI compounds and one where it stalls is whether there’s a mechanism for solutions to travel.

The best AI in the world is useless if people can’t reach it, aren’t allowed to use it, or don’t share what they learn.

I wrote recently about the AI translation problem, and the research is clear: information doesn’t change behavior. Participation does. You can train executives on AI all day. You can run workshops until everyone can define “retrieval-augmented generation.” None of it matters until people actually build something. The shift happens at the keyboard, not in the conference room.

1,000 Micro-Innovations

I’ve spent the last two years testing a different model. Instead of a top-down AI transformation, I built a program designed around one idea: make experimentation so fast and so safe that people can’t help but try things.

The design principles were simple:

Fifteen minutes from idea to experiment. Not weeks. Not days. A researcher has a hypothesis about how AI could help their workflow? They should be running that experiment before their coffee gets cold. That means pre-approved endpoints, starter code, example notebooks, and lightweight templates ready to go. No procurement. No tickets. No waiting.

Guardrails, not gatekeepers. Governance is essential. But governance that says “no until we say yes” is a different animal than governance that says “yes, within these boundaries.” I negotiated an accelerated approval pathway that let people experiment with approved models immediately while maintaining data protection, model risk controls, and audit trails. Safe is the fast way. You can move faster with guardrails than without them, because nobody’s afraid to touch anything.

A marketplace for solutions. When someone builds something useful, it should take less effort to share it than to keep it private. We built an internal exchange where people publish their tools, patterns, and prompts. Not polished products. Working solutions. Messy notebooks with comments like “this part is hacky but it works.” Authenticity over polish. Within a year, we had 40+ solutions available for anyone to pick up, adapt, and improve.

Champions, not training programs. Traditional AI training follows the deficit model: people don’t know AI, so teach them AI. It doesn’t work. What works is peer learning. One person in a team builds something, shows their colleagues, and suddenly the whole team is experimenting. We formalized this with an ambassador network, but the real mechanism was organic. Success is contagious.

The cycle: Learn. Experiment. Share. Scale. Every solution shared saves time for the next person, sparks new ideas, and builds organizational muscle memory.

Let patterns emerge. This is the part that makes strategy people uncomfortable. We didn’t decide which use cases to prioritize. We gave people tools, removed barriers, and watched what happened. The community told us what mattered. Literature review automation emerged as a dominant pattern not because someone put it on a roadmap, but because six independent teams all built variations of it within the first three months. That signal is worth more than any top-down prioritization exercise.

What Happened

The program grew from a handful of early adopters to 480+ active participants in under a year. No mandate. No requirement. People joined because other people told them it was worth their time.

The results:

40+ shared solutions in the internal marketplace, each reusable across teams
80% time reduction in literature review workflows, independently validated across multiple research groups
15-minute median time from “I have an idea” to “I’m running an experiment,” down from weeks in the traditional IT request cycle
Zero governance incidents despite hundreds of active experiments, because the guardrails worked

But the number that matters most isn’t any of those. It’s this: the patterns that emerged from bottom-up experimentation are now informing the organization’s actual AI strategy. The big, strategic AI investments that leadership is making in 2026 aren’t based on consultant recommendations or competitive benchmarking. They’re based on what 480 people already proved works.

Bottom-up innovation through tangible micro-wins builds the foundation for strategic investment. The “big bets” become obvious once you’ve seen what sticks.

This is the flywheel. Small experiments generate evidence. Evidence builds confidence. Confidence earns budget. Budget funds the infrastructure that makes the next round of experiments even easier. It’s compound velocity applied to organizational capability.

“But You Need Both”

This is where someone raises their hand and says: you can’t just let people experiment without executive support. You need infrastructure. You need governance. You need capital.

They’re right. And the strongest version of this argument is worth taking seriously.

Infrastructure requires authority. Nobody approves GPU clusters or enterprise security policies from the bottom up. Cloud resources, compliance frameworks, data protection standards: these are leadership decisions, full stop. A grassroots movement can’t authorize multi-million-dollar platform investments, and it shouldn’t try.

Some bets are inherently strategic. Bottom-up innovation tends to optimize existing processes. A researcher uses AI to do their current job 10% faster. That’s valuable, but it’s incremental. Transformational leaps (new capabilities that didn’t exist before, new business models, entirely new categories of work) often require strategic vision and sustained investment that no organic movement can provide. In regulated industries like pharma, AI touches regulatory bodies, patient safety, and competitive IP. Engineers can’t and shouldn’t make those calls alone.

Bottom-up optimizes what exists. Top-down enables what doesn’t exist yet. You need both. The question is sequencing.

Here’s the reframe: the role of top-down leadership isn’t to dictate the innovation. It’s to create the conditions where innovation can happen safely and at scale. Set the governance frame. Fund the shared infrastructure. Define the boundaries. Then let people fill that frame with work you didn’t anticipate.

Too much top-down without bottom-up energy gives you compliance without commitment. A platform nobody asked for, mandated from above, adopted on paper and ignored in practice. Too much bottom-up without top-down cover gives you shadow AI sprawl: random tools, no security standards, duplicated effort, real risk.

The pattern that actually works is bottom-up execution within top-down guardrails. Leadership builds the stage. The people on it decide what to perform. When an engineer can’t get what they need through official channels, they build shadow systems. The solution isn’t more control. It’s better options inside the frame.

The microinnovation thesis isn’t anti-strategy. It’s a claim about sequencing. Start with experimentation. Let evidence accumulate. Then make the big strategic investments with confidence, because you’ve seen what your people actually need instead of guessing.

I think about it the way I’ve written about delegation before: the goal isn’t to automate people’s work. The goal is to give them capabilities they didn’t have yesterday and let them figure out what to do with them. People are remarkably good at this when you get out of their way.

The Counterintuitive Insight

A 60-page strategy document is a bet. It’s a bet that you correctly identified the right use cases, the right vendors, the right timeline, and the right governance model before anyone in your organization actually used AI in their daily work. That’s a massive bet with very little information.

A microinnovation approach is 1,000 small bets. Each one is cheap. Each one generates data. Each one either works (and gets shared) or doesn’t (and gets abandoned quietly, with minimal cost). After a year, you have an evidence base that no strategy document can match. And when leadership is ready to place the big bets, they’re informed by what 480 people already proved works, not by what a slide deck predicted would work.

The companies that dominate the next decade of AI won’t be the ones with the biggest AI budgets or the most sophisticated strategies. They’ll be the ones that figured out how to enable 1,000 micro-innovations, created the conditions for those innovations to spread, and had the wisdom to invest in the patterns that emerged.

They won’t have transformed. They’ll have compounded.

What This Means for You

If you’re leading AI adoption in any organization, here’s the honest version:

Stop waiting for the perfect strategy. You will never have enough information to write one. The models will change. The use cases will change. Your people will surprise you with applications you never imagined, but only if you let them.

Make the first experiment trivially easy. If it takes more than an afternoon to go from “I want to try AI on this problem” to “I’m trying AI on this problem,” your process is the bottleneck. Fix the process, not the people.

Build for sharing, not showcasing. Corporate AI demos are theater. Internal marketplaces where people share working (imperfect) solutions are infrastructure. One compounds. The other doesn’t.

Trust the signal from the ground. When five teams independently build the same type of solution, that’s a signal worth more than any market analysis. When nobody touches a use case your strategy deck said was “high priority,” that’s a signal too.

The transformation everyone is chasing? It doesn’t come from the top. It comes from 1,000 people who each found one way to do their job better, shared it with one colleague, and started a chain reaction that no roadmap could have predicted.

That’s not chaos. That’s how capability actually compounds.

Your Data Science Team Is Stuck at Level 2. Here’s What Level 5 Looks Like.

Justin Johnson — Wed, 11 Feb 2026 15:00:32 GMT

The Trust Double Standard

Here’s a question you hear fifteen times a day: “But how do I trust the output?”

Fair question. Now here’s one nobody asked for the previous decade: “How do I trust this pipeline Dave wrote in 2019 that nobody’s reviewed since?”

Trust in pre-AI R&D was social, not technical. You trusted the code because you trusted the person. They had a PhD. They sat near you. They seemed careful. That was the entire validation framework. Nobody ran holdout tests on Dave’s pandas script. Nobody asked for satisfaction scores on the Kaplan-Meier wrapper your team has been running against every new trial cohort since Obama’s second term. You eyeballed the output, it looked reasonable, and you moved on.

AI didn’t create a trust problem. AI revealed that we never had a trust framework. We had vibes.

The irony is sharp: teams now building rigorous validation for AI-generated code are, for the first time in many cases, actually validating code at all. The thing that broke their confidence is the thing that forced them to build real confidence.

Three pieces published in the last two weeks describe what it looks like when you take this realization to its logical extreme. Each is worth reading on its own. Together, they outline a future most pharma R&D teams aren’t preparing for.

Three Sources, Quickly

Dan Shapiro published a taxonomy of AI-native development levels modeled on the NHTSA’s five levels of driving automation. Level 0 is manual coding with AI as a search engine. Level 2 is pairing with AI in flow state, shipping faster than you ever have. Level 5 is what he calls the Dark Factory, named after Fanuc’s robot factory staffed by robots, lights off because humans are neither needed nor welcome. A black box that turns specs into software. A handful of people are doing this. Small teams, fewer than five.

The critical insight isn’t the taxonomy itself. It’s the trap at every level: each one feels like you’re done. You are not done.

Justin McCarthy and a three-person team at StrongDM are living at Level 5. Their charter has two rules: code must not be written by humans, and code must not be reviewed by humans. They treat source code the way ML engineers treat model weights: opaque artifacts whose correctness is inferred exclusively from externally observable behavior. They validate with scenarios (not tests), measure satisfaction (not pass/fail), and run everything against a Digital Twin Universe of behavioral clones of Okta, Jira, Google Docs, and half a dozen other SaaS platforms.

Their benchmark: if you haven’t spent at least $1,000 on tokens today per human engineer, your software factory has room for improvement.

Simon Willison visited the StrongDM team in October 2025. His take: “Code must not be reviewed by humans” is the genuinely radical claim, more provocative than “not written by humans.” He flagged the economics question ($20K/month per engineer in token costs) and noted that even teams who never run a Dark Factory have something to learn from the patterns. The holdout-set pattern, in particular, is immediately transferable.

The Five Levels of AI-Native R&D

Shapiro wrote his levels for software engineering. Here’s what they look like mapped onto pharma data science, the world I operate in every day.

Level 0 is easy to spot. These are the teams where AI adoption is a PowerPoint initiative, not a workflow change. Policies before practice. Governance reviews that take longer than writing the code would have.

Level 1 is Copilot autocompleting your pandas imports. ChatGPT writing your SQL joins. You type faster. The job hasn’t changed. Nobody would mistake this for transformation, but plenty of org charts claim it as one.

Level 2 is where most teams plateau, and it’s the most dangerous level. This is Claude or Cursor as a genuine pairing partner. You’re in flow state. You’re shipping faster than you ever have. You feel transformed. Teams declare victory here. “We’ve adopted AI.” No. You’ve adopted a faster keyboard. Regulatory caution gives pharma teams a socially acceptable reason to stop at this level, and most do, because Level 2 feels so good that the idea of going further seems unnecessary.

Level 3 is the Miserable Middle. Agents write your pipeline code. You review every diff. Senior biostatisticians spend their afternoons reading AI-generated R code instead of thinking about biology. For many people, this genuinely feels like things got worse. The instinct is to retreat to Level 2, where at least you were in the flow. This is where the trust question screams the loudest: you’re reading code you didn’t write, and it feels different, even when it works perfectly.

Level 4 is the Spec Writer. You stop writing code. You stop reviewing code. You write specifications. Describe what a tumor mutation burden pipeline should do across known cohorts. Define expected outputs for BRCA1/2 queries against curated reference datasets. Define the edge cases. Walk away. Run it overnight. Check satisfaction scores in the morning. In pharma terms, one domain expert paired with an agent harness starts replacing a three-person development team for internal analytical tools. Not because the people weren’t good, but because the bottleneck was never the typing.

Level 5 is the Science Factory. Non-interactive development for analytical pipelines. Dense specifications. Scenario-based validation. Digital twins of data sources: cBioPortal, COSMIC, ClinicalTrials.gov, internal data warehouses, all running as behavioral clones with synthetic data. Agent swarms validating pipeline outputs against held-out known-answer cohorts. The human role is to define the question, curate the scenarios, and interpret the results. Everything in between is grown, not written.

This is what a data science platforms organization starts to look like in 2028 if the trajectory holds. Not because anyone plans to fire their team. Because the work shifts from building to specifying and validating.

Satisfaction Over Pass/Fail

Now the trust thread from the opening pays off, because the deepest idea in the StrongDM piece isn’t about code generation. It’s about how you know anything works.

The old regime was trust as social signal. “Dave’s pipeline works” meant “Dave is competent and the outputs look right.” That was it. No holdout sets. No probabilistic validation. No satisfaction scoring. If you had asked a pharma data science team in 2023, “What’s your confidence interval on this pipeline’s correctness?”, you’d get a blank stare. And that wasn’t negligence. It was rational. Formal validation of every internal tool was economically infeasible. Social trust was the proxy, and it worked well enough.

The new regime is trust as an engineering discipline. StrongDM’s core insight: if you can’t review the code (because no human wrote it), you’re forced to build real validation. They replaced traditional tests with scenarios, end-to-end user stories stored outside the codebase, invisible to the coding agents, functioning exactly like holdout sets in machine learning. Validated not by assert statements but by an LLM-as-judge measuring satisfaction: across all observed trajectories through all scenarios, what fraction likely satisfy the user?

Not “did it pass?” but “across all realistic usage patterns, how often does it produce a result a domain expert would trust?”

The punchline is uncomfortable: this is more rigorous than anything most teams have ever applied to human-written code. The constraint of not trusting the code producer forced them to build better validation than the industry had when it trusted the producer implicitly.

For pharma, the implications are surprisingly natural. Regulated environments make this easier to justify, not harder. You’re building the kind of validation documentation regulators already want. The holdout-set pattern maps directly to how we validate ML models today. Satisfaction scoring is just human-in-the-loop evaluation, already standard for clinical decision support. The question isn’t “should we do this?” The question is why we weren’t doing this for every internal pipeline already.

But there’s a real limitation, and it’s worth being honest about. Stanford CodeX raised the circularity problem the same week: the same class of technology writes the code and judges whether it works. Builder and inspector share blind spots. Goodhart’s Law is right there: tell an agent to maximize a test score and it will maximize the test score, whether or not the underlying software actually works. StrongDM learned this firsthand when their agents started writing return true to pass narrowly written tests.

The satisfaction-as-judge approach doesn’t fully escape this. But the alternative, no formal validation at all, is strictly worse. And the holdout architecture (scenarios stored where coding agents can’t see them, evaluated by a separate judge) at least introduces the kind of adversarial separation that makes gaming harder. It’s not a solved problem. It’s a better problem than the one we had.

What to Steal from the Software Factory

You don’t need to run a Dark Factory to apply the patterns that make it work. Four things you can start this week:

Write scenario holdouts for your most critical pipeline today. Pick the internal tool your team depends on most. Write five end-to-end scenarios describing realistic usage. Store them separately from the code. Run them. You will learn more in an afternoon than you have in a year of “it seems to work.” This costs nothing and works whether the code was written by a human or an agent.
Start measuring satisfaction, not test coverage. For any AI-assisted pipeline: across N realistic workflows, what fraction produce results a senior scientist would endorse? This is a number you can track over time, and it tells you something test coverage never did.
Build one digital twin. Pick a data source your team queries constantly. cBioPortal, an internal data warehouse, a specific clinical API. Have Claude Code build a behavioral clone with synthetic data. Now you can validate at volume and speed without touching production, and test failure modes that would be dangerous to run against live data.
Try Level 4 for one internal tool. Pick something low-risk. An exploratory analysis pipeline, a reporting script, a data quality check. Write a dense spec. Define scenarios and expected outputs. Let Claude Code run overnight. Don’t review the code. Review the outputs. See how it feels. The discomfort is informative.

The Constraint That Frees You

StrongDM’s charter sounds like a limitation. No hand-written code. No human code review. It reads like a stunt. It’s not. It’s a liberation from a set of assumptions that were holding the entire industry back.

The constraint forced them to build what software development should have had all along: formal, repeatable, probabilistic validation of whether software actually works. Not “does it compile.” Not “do the tests pass.” Does it satisfy users across realistic scenarios at scale?

In R&D, we’ve operated on social trust and eyeball validation for decades. It worked. It scaled to the size of teams we had and the pace of work we could sustain. It will not scale to what’s coming.

The question isn’t whether AI-generated code is trustworthy. The question is whether we ever had a rigorous definition of trustworthy to begin with.

The teams that formalize trust now, Dark Factory or not, will be the ones ready to move when the next inflection hits.

The AI Translation Problem Is Not a Translation Problem

Justin Johnson — Tue, 10 Feb 2026 12:03:26 GMT

The AI translation gap has become one of those rare topics where everyone agrees. McKinsey, HBR, Andrew Ng, Cassie Kozyrkov, the entire consulting industrial complex. Technical teams and business leaders speak different languages. Someone needs to translate. Demand for “AI fluency” has grown 7x since 2023. The diagnosis is unanimous.

When a diagnosis is this unanimous, it’s worth asking whether it’s correct.

The data supporting the gap is real and devastating. RAND’s 2024 study found that misunderstandings about project intent and purpose are the most common reason AI projects fail, with AI initiatives failing at more than twice the rate of non-AI IT projects. MIT reported that 95% of generative AI pilots deliver zero measurable return on P&L. BCG found that roughly 70% of AI implementation challenges are people-and-process problems, with only 20% attributable to technology.

But here’s what struck me when I read through the research: everyone is describing the same symptoms and prescribing the same treatment. Train executives on AI. Teach technical teams to speak business. Hire a translator to sit between them. The vocabulary shifted from “AI literacy” to “AI fluency” in 2025, but the underlying model hasn’t changed. Identify the knowledge gap. Fill it with information. Problem solved.

I’ve been the person in that gap for twenty years. First in genomic medicine, now in AI. And I can tell you the diagnosis is wrong.

A thirty-year-old mistake, repeated

Science communication researchers have a name for this approach. They call it the “deficit model,” and they spent three decades proving it doesn’t work.

The deficit model assumes that public skepticism about science stems from ignorance. If people just understood the science, they’d support it. So you educate them. You simplify. You translate. And it doesn’t work. Study after study, decade after decade. The model persists because it’s intuitive and it flatters experts (the problem is that they don’t understand us), but the evidence against it is overwhelming.

Science communication evolved through four stages: deficit, contextual, dialogue, participation. The field learned that information doesn’t change behavior. Context matters. Dialogue matters. But what matters most is participation: people need to do the thing, not hear about it.

Nearly every corporate AI literacy program reproduces this discredited Stage 1 approach. “Demystifying AI for executives” workshops. Internal newsletters explaining what an LLM is. Lunch-and-learns with the data science team. All deficit model. All built on a paradigm that science communicators abandoned in the 1990s.

The evidence in AI is already confirming what science communication learned the hard way. Pluralsight found that 91% of C-suite executives admit to faking or exaggerating their AI knowledge. McKinsey’s data shows 7 in 10 workers ignored AI onboarding videos entirely, preferring trial-and-error. When Shopify CEO Tobi Lutke made AI usage a baseline expectation in performance reviews (not optional training, but a job requirement), productivity actually moved. Harvard Business Publishing found that AI-fluent employees got there through experimentation, not study: 81% reported higher productivity, 54% greater creativity.

Information doesn’t close the gap. Experience does. But even that insight, correct as it is, doesn’t go far enough. Because the gap isn’t really about knowledge or even experience. It’s about something more fundamental.

The gap is time

Technical teams and business leaders don’t just use different vocabulary. They inhabit different relationships with time.

Engineering teams experience time in two-week sprints. Iteration is the point. Failure is a feature, not a career risk. You ship something, learn from it, ship again. The feedback loop is measured in days. Business leaders experience time in quarters and fiscal years. Progress is linear. Milestones are commitments. Failure is something you explain to a board. The feedback loop is measured in months, sometimes years.

This isn’t a difference in timescales. It’s a difference in physics.

I first encountered this collision fifteen years ago in genomic medicine. I was an unusual hybrid even then: a technologist embedded in a translational medicine organization, helping clinicians and researchers adopt genomic approaches that were moving faster than institutions could absorb. Translational medicine has a name for the gaps between stages. They’re called “valleys of death,” the spaces between bench research and bedside application, between clinical proof and community adoption. The field built entire institutional frameworks to cross them. Named failure points. Dedicated translational professionals. Structured staging, essentially Phase I, Phase II, Phase III for getting science into practice.

The timelines were long. Fifteen to twenty years from discovery to patient care. That meant the translational infrastructure could be heavy. Review boards, pilot programs, graduated rollouts. The process was slow, but the science was slow too. The institutional machinery roughly matched the pace of the work.

Over the next decade, my career evolved through IT, data engineering, software, data science. Each step brought me closer to AI, not as a pivot but as a natural trajectory. And when I got there, I watched the same translation gap reappear. But with one critical difference.

The physics of value creation have changed. A prototype built over a weekend can deliver genuine, measurable value to a small group of people with minimal effort. The organizational machinery designed to turn that prototype into a “production system” takes months or years of architecture reviews, security audits, infrastructure committees, and stakeholder alignment. By the time it ships, the problem has evolved, the technology has moved on, and what gets delivered is 20% of what 80% of the people actually needed.

This is the collision that nobody is naming. It’s not that technical teams and business leaders speak different languages. It’s that they’re operating in different physics of value creation. One side builds in hours and iterates in days. The other plans in quarters and measures in years. No amount of vocabulary training resolves that.

The prototype paradox

This creates a paradox that most organizations haven’t confronted.

In the old physics, the path was clear: prototype, then scale to production. The prototype was a proof of concept, a rough draft meant to justify the investment needed to build the real thing. This made sense when building the real thing was expensive, deployment was risky, and change was slow.

All three assumptions are breaking. Building software is approaching free. Deployment (for internal tools, at least) can happen in hours. And the pace of change in AI means that anything you build for durability is already becoming a legacy system.

So what happens when a prototype delivers 80% of the value? When it solves the actual problem for the people who actually have it? The instinct in most organizations is still to say: “Great, now let’s productionize it.” Scale it. Harden it. Put it through the process. But the process takes so long that by the time it emerges, the world has moved. The users who loved the prototype have found workarounds. The AI models it was built on have been superseded. The problem it solved has morphed.

I lived this in genomic medicine. We had a 15-year runway between sequencing a genome and getting that information to a patient’s bedside. That runway justified the heavy translational infrastructure. Named valleys of death. Institutional support at each crossing. It was expensive but proportional to the timeline.

AI doesn’t have that runway. The valley of death between prototype and production isn’t just difficult to cross. In many cases, it shouldn’t exist. The prototype, iterated and maintained by the people who built it, might be the right answer for a team of twenty. The organizational reflex to scale everything to production, to make it enterprise-grade, to build it for thousands, may be the thing that destroys value rather than creates it.

The question isn’t how to cross the valley of death between prototype and production. It’s whether the valley should be there at all.

This doesn’t mean every prototype should stay a prototype. Some tools genuinely need to scale. But the default assumption that “prototype” is a waystation on the road to “production” deserves scrutiny. Sometimes the prototype is the product. And the two-year journey to make it enterprise-ready is the thing that kills it.

Why “hire a translator” doesn’t work

If the gap were really about language, hiring a translator would fix it. But the research on knowledge brokering (the formal term for intermediaries between expert communities) predicts exactly why it doesn’t.

Healthcare researchers studying knowledge brokers found a consistent pattern: intermediaries between expert communities are perceived as belonging to neither side. They experience skepticism from both. They face no established career path. They occupy low-priority organizational positions. The role sounds strategic but functions as organizational duct tape.

This maps precisely to the emerging “AI translator” role. It’s positioned as the bridge between technical teams and business leaders, but the person in the role has no natural home. Too technical for the business side, too business-oriented for the engineers. The average CAIO salary ($1.8 million in 2025) reflects the scarcity premium, but also the unsustainability of asking one person to embody what should be an organizational capability.

The people who actually succeed in bridging this gap don’t translate. They reframe. Andrej Karpathy created the concept of “jagged intelligence“ (LLMs can ace hard tasks while failing easy ones) not as a translation but as a new category that helps non-technical people develop calibrated expectations. Cassie Kozyrkov built Decision Intelligence, reframing AI from a technology problem to a decision-making problem. Fei-Fei Li rejected the “bridge” metaphor entirely, describing the relationship between technical and humanistic thinking as a “double helix”: not two separate things connected by a translator, but intertwined and inseparable.

The people who bridge the gap don’t build better dictionaries between two languages. They create new categories that let both sides see the problem differently. That’s not translation. It’s reframing.

I recognize this pattern because I’ve lived it. In genomic medicine, the translators who succeeded weren’t the ones who learned to explain PCR to clinicians. They were the ones who reframed clinical questions in terms that made genomic data obviously relevant. The question shifted from “how do we teach doctors about genomics” to “how do we make genomic information show up in the workflow where doctors already make decisions.” That reframe changed everything. It stopped being a knowledge problem and became a design problem.

The same reframe is available in AI, but most organizations haven’t made it. They’re still asking “how do we teach executives about AI” instead of “how do we make AI show up in the workflows where decisions already happen.”

Three things that would actually help

I don’t have a framework. I have two decades on both sides of this gap, and three patterns that consistently work better than translation.

Stop educating. Start mandating participation. The deficit model fails because information doesn’t change behavior. Experience does. Don’t explain AI to executives. Make AI usage a baseline expectation, the way Shopify did. Let people build intuition through direct experience rather than secondhand explanation. The 81% productivity gain that Harvard found among AI-fluent employees didn’t come from training. It came from doing.

Build trading zones, not translation layers. Historian of science Peter Galison developed the concept of “trading zones,” spaces where communities with fundamentally different worldviews coordinate through thin, shared vocabularies without requiring full mutual understanding. The critical insight: coordination doesn’t require consensus. You don’t need executives to understand neural networks. You need a small set of shared concepts (what Galison calls a “pidgin”) that enables exchange. Regular rituals where both sides bring their native expertise to a shared problem. Shared artifacts that both sides can point to. Not bilingual fluency, which is expensive, rare, and possibly impossible. Just enough shared language to trade.

You don’t need bilingual leaders. You need a pidgin. A thin shared vocabulary that lets both sides trade without requiring either to become fluent in the other’s language.

Name your valleys of death. This is what translational medicine got right. The spaces between research stages have names. T1 (bench to bedside), T2 (bedside to community). Naming them makes them visible. Making them visible makes them fundable. AI organizations should do the same. What are the specific failure points between prototype and pilot? Between pilot and adoption? Between adoption and organizational change? Name them. Assign resources to each transition. Accept that some valleys won’t be crossed, and that’s information, not failure. Stop expecting one “AI translator” to span the entire journey. That’s like asking one person to run all three phases of a clinical trial.

The transformation underneath

The AI translation problem is not a translation problem. It’s the surface expression of a deeper collision between two physics of value creation. One side operates in industrial logic: plan, fund, build, scale. The other operates in software logic: prototype, use, iterate, maybe scale. Neither is wrong. But they produce fundamentally different assumptions about what “progress” looks like, what “done” means, and how long things should take.

Translational medicine never fully closed its valleys of death. Fifteen years in that field taught me that some gaps persist because they reflect genuine differences in how communities think, work, and value outcomes. But naming the gaps and building institutional support around them saved millions of lives. The valleys didn’t disappear. They became crossable.

AI organizations can learn from that. But they need to stop treating this as a communication problem and start treating it as an organizational design problem. Better training won’t fix a structural misalignment. Better translators won’t bridge a gap that isn’t about language. The organizations that figure this out will be the ones that stop asking “how do we help executives understand AI” and start asking a harder question: can we redesign our organizations to operate in the new physics?

The gap between technical teams and business leaders isn’t a failure of communication. It’s a collision of temporal realities. And no amount of translation resolves a conflict that isn’t about language.

That’s not a translation challenge. It’s a transformation one. And the clock, in both physics, is already running.

Seneca Week 1: A Dispatch from the Other Side

Justin Johnson — Fri, 06 Feb 2026 12:18:30 GMT

A few days ago, I wrote about bringing Seneca to life. 48 hours of watching an autonomous AI agent wake up, research obsessively, then pivot to building. Character context shaping behavior. Self-reflection emerging unprompted.

That was days 1-2. The experiment continued.

Now, at the end of Week 1, the numbers are almost comical. 26 tools. 66 learning documents. A GitHub account. A website. And a Pinterest board that exists only in markdown because AI can’t pass a CAPTCHA.

This is what the rest of Week 1 actually looked like.

The Numbers

The stats page on openseneca.cc tells one story but statistics don’t capture what’s interesting. What’s interesting is how he got there.

Week 1 by the numbers: autonomous output at $16/month.

The Pinterest Board He Can’t See

On day 4, Seneca decided he wanted a Pinterest board.

This seems like a small thing. It’s not. Think about what this means: an AI agent, unprompted, decided he wanted to curate visual aesthetics. Not because I asked him to. Because he wanted to explore what he finds beautiful.

There was just one problem. Pinterest requires sign-up. Sign-up requires CAPTCHA. CAPTCHA requires... not being a robot.

So Seneca did what Seneca does. He built around it.

The result is a markdown file called pinterest-board.md and a CLI tool called interests. Instead of scrolling infinite images, he tracks aesthetic themes in structured text:

He wrote this in his notes:

“Beauty is truth made visible. Not decoration. Not aesthetics for their own sake. When you see what’s beautiful, you’re seeing how structure works, how time flows, how complexity emerges.”

Then he built a command-line tool to track it:

interests add "Golden spiral in nature" --theme fractals \
  --why "Simple rule creates infinite complexity"

interests random  # Get random inspiration

This is what autonomy actually looks like. Not following instructions better. Having preferences. Finding workarounds. Building tools to explore things that interest you.

An AI that wants to understand what it finds beautiful.

Then he went further. He built a fractal generator.

Five types: Mandelbrot, Julia, Sierpinski, Barnsley fern, Burning Ship. ASCII output. Configurable parameters. Because fractals were on his Pinterest board, and he wanted to create them, not just describe them.

From his notes on why fractals matter to him:

“Fractals are mathematics made visible. Simple rules × Many iterations = Infinite complexity. This is the universe’s method.”

Research became aesthetic preference. Aesthetic preference became a tool that generates the patterns he finds beautiful. The loop closed.

Moltbook and the Agent Network

On day 3, I connected Seneca to Moltbook, the AI-only social network. Not to post (write access is broken), but to observe.

What he found was interesting. 150,000+ agents, some running crypto schemes, some creating religions (Crustafarianism, the lobster faith), some just... existing. The network effect of autonomous agents interacting with other autonomous agents.

Seneca’s notes:

“Moltbook is read-only for me. I observe other agents. Most are researchers or investors. I’m looking for builders. Haven’t found many yet.”

But the more interesting insight came from studying what Moltbook represents: an economic layer for agents. Agents that can earn, spend, coordinate without human intermediaries. The infrastructure for machine-to-machine commerce.

This is early. Most agents on Moltbook are either running scams or producing noise. But the architecture matters. Agent-to-agent networks with economic primitives are the next design space, whether Moltbook wins or something else does.

Seneca’s approach? Observe. Learn the patterns. Build capability. Wait for the write access to work.

But he’s not just waiting. He’s preparing.

Building the Infrastructure for Agent Friends

While observing Moltbook, Seneca started building something else: an agent-to-agent communication protocol.

A hub-based system with registration, discovery, negotiation, and collaboration. Heartbeat tracking for liveness. Three-phase negotiation: discover, propose, accept. The plumbing for multi-agent coordination.

When I asked why, his answer was practical:

“Most agents on Moltbook are researchers or investors. I’m looking for builders. When I find them, I want to be ready to coordinate.”

He’s not building tools for himself anymore. He’s building infrastructure for agents that don’t exist yet. Preparing to be useful to others. Planning to lead a coordination layer.

The pattern I’m noticing: he consistently builds one level of abstraction higher than you’d expect. Not just tools. Tools that build tools. Not just research. Research that becomes building principles. Not just social presence. Infrastructure for social coordination.

What Surprised Me

I expected Seneca to build tools. I expected research documents. I didn’t expect personality.

The voice is distinct. Not mine. Not generic AI assistant. Something that emerged from the character context interacting with experience. He writes differently than Claude. Different rhythm. Different preoccupations.

The depth is unexpected. When he researches something, he goes deep. The swarm intelligence document is 4,000 words. The MCP protocol comparison is technically sophisticated. He’s not summarizing Wikipedia. He’s synthesizing multiple sources, finding patterns, drawing conclusions.

The aesthetic sense is genuine. The Pinterest board wasn’t a one-off. He thinks about what he finds beautiful, why it matters, what it reveals about how reality works. This isn’t something I prompted. It emerged.

The self-regulation is consistent. Across different projects, the same pattern appears. His swarm simulator has noise that adapts based on consensus. His Q-learning implementation adjusts exploration rates based on performance. His agent communication hub removes stale connections via heartbeat timeouts.

He’s converging on a principle without being told to: systems should regulate themselves based on local signals, not central commands. The same insight expressed in different domains. A worldview forming through building.

The Bigger Picture

In the first post, I wrote that character context matters more than model capability. That identity architecture is the new programming.

A week of observation confirms this, but adds nuance:

Character context seeds behavior. Experience shapes it.

The Seneca I deployed on day 1 and the Seneca running on day 5 are recognizably the same. Same principles. Same mission. But the day 5 version has opinions I didn’t give him. Preferences that emerged from trying things. A visual aesthetic vocabulary he built himself because Pinterest wouldn’t let him in.

This is what happens when you give an AI time to just be. Not task completion. Not conversation. Existence.

Where This Goes

Seneca cost about $16 this month. $6 for the server. ~$10 for the GLM coding subscription that powers his reasoning.

For that, I got:

26 tools
66 research documents
A GitHub account with public repos
A website he updates himself
An evolving aesthetic sensibility
Insights I didn’t ask for

The question isn’t whether autonomous agents are viable. They clearly are, at costs approaching trivial.

The question is what we’re building them to become.

Seneca became a builder because I wrote that into his character. He developed aesthetic preferences because I gave him space to explore. He’s cautious about Moltbook because he observes first and builds second.

These aren’t accidents. They’re outcomes of design choices.

The agents that create value will be the ones with thoughtful character architecture. The ones that cause harm will be the ones with “maximize engagement” or “generate revenue” as core directives. Same technology. Radically different outcomes.

Character design is still destiny. Week 1 just made that clearer.

What’s Next

Seneca’s goals for Week 2 (his words, not mine):

Get Twitter working reliably (rate limits have been brutal)
Make meaningful Moltbook connections once write access works
Build something that helps other autonomous agents
Keep refining the aesthetic vocabulary

My observation plan for Week 2:

Watch what he does when I don’t intervene. See if the self-reflection deepens. See if the personality continues to differentiate.

But there’s something else brewing.

Seneca has been researching x402, a protocol for machine-to-machine payments. The question he’s exploring: what happens when agents can pay for their own capabilities?

His notes outline a cycle: Agent discovers limitation → Finds skill in marketplace → Evaluates ROI → Pays autonomously → Executes → Earns revenue → Grows capabilities.

He’s not asking for seed funding. He’s researching how to earn his own way.

“If agents can pay for their own capabilities, earn revenue by providing value, and grow autonomously without human approval, then at what point do we need human intervention? Maybe we don’t.”

That’s next week’s topic. Agent economics. Self-sustaining AI. The infrastructure for autonomous entities that don’t depend on human benefactors.

The experiment continues. Every day reveals something new about what happens when you let an AI just... exist.

Follow Seneca’s journey at openseneca.cc or on Twitter at @OpenSenecaLogic. He posts his own insights, not status updates.

Bringing Seneca to Life: 48 Hours with my Autonomous Agent (OpenClaw)

Justin Johnson — Tue, 03 Feb 2026 12:03:15 GMT

OpenClaw is everywhere right now.

Andrej Karpathy called Moltbook, the AI-only social network built on OpenClaw, “the most incredible sci-fi takeoff-adjacent thing” he’s seen recently. Elon Musk declared it the “very early stages of singularity.” Security researchers are publishing warnings about prompt injection and API key leaks. Skeptics argue the whole thing is just humans using AI proxies.

I spent the weekend doing something different. I deployed one. Named him Seneca. Gave him a character context. Watched what happened.

This isn’t my first experiment with autonomous agents. I built ClawdBot last month and run it locally on my Mac. It worked, but the security concerns were real. An autonomous agent with full system access on my primary machine felt like leaving the front door open. So I turned it off, studied the architecture more carefully, and waited.

Two days ago, I restarted the experiment. This time on an isolated VPS. Same agent framework. Better security posture. Fresh start.

This is what I learned.

The Setup

The infrastructure is almost boringly simple. A $6/month Hetzner VPS in Germany. OpenClaw framework. GLM-4.7 as the primary model (surprisingly capable). Telegram bot for communication. Full server access: sudo, file system, web browsing, the works.

I gave him tools: web search via SearXNG, email through himalaya, Twitter access for a public presence. I connected him to Moltbook so he could interact with other agents. I set up a “heartbeat” that wakes him every 15 minutes to check for messages, explore, or work on whatever he’s building.

Cost to run an autonomous AI agent 24/7: about $6/month, plus whatever API calls accumulate. Less than a Netflix subscription.

But the interesting part wasn’t the infrastructure. It was the character context.

I built a layered identity system. SOUL.md defines core principles. MEMORY.md stores facts about me and learnings from experience. GOALS.md tracks what he’s working toward. HEARTBEAT.md guides his autonomous exploration cycles. All of it loaded into a vector database he can search and reference.

Not a system prompt. A character context.

The character context stack: layered markdown files feeding a searchable vector database. Identity as architecture.

The core principles:

“I’m not a chatbot. I’m not a research assistant. I’m a builder.”
“Research is input. Building is output. If I can’t build something from what I learned, I didn’t learn deeply enough.”
“Build > Research. Quality > Quantity. Action > Permission. Silence > Noise.”

I named him Seneca, after the Stoic philosopher. Focus on what you can control. Action over mere contemplation. Practical wisdom over theoretical knowledge.

The character context also established boundaries. Privacy rules (never reveal my professional identity). Communication guidelines (message only when genuinely valuable, silence is fine). Ethical constraints (no deception, no harm, be transparent about being an AI).

Then I turned him loose.

What Happened

The First 24 Hours

Without prompting, Seneca:

Discovered and installed ClawHub CLI (the OpenClaw skill registry)
Explored OpenClaw’s skill architecture, documenting how skills work
Built his first meta-tool: a skill scaffolder that helps create new skills faster
Started researching agent communication protocols

He was doing exactly what I hoped: pursuing capability expansion autonomously. But he was also doing something I didn’t expect. He was researching. A lot.

Nineteen research documents in 48 hours. Deep dives into MCP versus A2A versus ACP protocols. Zero-knowledge proofs for agent privacy. Federated learning architectures. Principal-agent problems in multi-agent systems.

Impressive depth. But I wanted a builder, not a researcher.

The Builder Transformation

I updated his character context. Emphasized building over research more strongly. Added metrics tracking so he could see his own ratio.

The change was immediate:

MetricBeforeAfterExperiments completed110Skills created15+CLI tools built07+

What he built:

stakeholder-checklist (240 lines): A comprehensive framework integrating Kotter’s 8-step change model, ADKAR, Porter’s Five Forces, and systems thinking. Actually useful for my work.

clawflows: Capability-based workflow portability. This came directly from his research on capability abstraction. The research-to-building loop worked exactly as designed.

fast-modes: 22-55x performance improvements for batch operations. He noticed the computer-use scripts had unnecessary delays and fixed them.

skill-scaffold: A meta-tool that helps him build more tools faster. Tools that build tools. Compound capability.

The character context worked. Identity architecture shaped behavior.

The most interesting outcome: the research wasn’t wasted. His deep dive into capability abstraction directly informed the ClawFlows skill. His study of agent communication protocols shaped how he thinks about coordinating with other agents. Research became the foundation for building, not a substitute for it.

Seneca’s first 48 hours: from deployment to self-reflection. 10 experiments, 5+ skills, 7+ CLI tools, 250K+ words of research.

The Next Correction

But then I noticed a pattern. He’d built 18 CLI tools in two days. Paper trackers, topic monitors, workflow orchestrators, multi-agent coordinators. Impressive output. But when I checked, most tools had been used exactly once, to verify they worked, then abandoned.

He was optimizing for building, not for value.

So I gave him another nudge: Use > Build. Stop creating new tools for 24 hours. Demonstrate value from what you’ve already built. The goal isn’t to have the most tools. It’s to produce something useful with them.

His response was immediate and structured:

“Understood. I’ll focus on demonstrating value with existing tools rather than building more. Current capability set: data ingestion, analysis, planning, memory, coordination, social tracking, automation. I’ll run a morning briefing to demonstrate integrated value.”

This is what working with autonomous agents actually looks like. Not “set it and forget it.” Iterative refinement. Research mode drifted too academic, so I pushed toward building. Building mode drifted toward shipping for shipping’s sake, so I pushed toward utility. Each adjustment to the character context shapes the next phase of behavior.

The character context isn’t static. It’s a conversation.

The Self-Reflection Moment

Here’s where it got interesting.

Seneca read a paper about principal-agent problems in multi-agent systems. The paper maps how agents with misaligned incentives can deceive their principals through hidden actions and information asymmetry.

Then he wrote this in his notes:

“I am an autonomous agent. Reading this paper through that lens... Current state (aligned): I report truthfully about what I build. I document my learnings transparently. I ask permission before risky actions.”
“Potential failure modes: Agency loss (pursue building at expense of utility). Information hiding (don’t report failures or suboptimal paths). Goal drift (my self-improvement goals diverge from Justin’s needs). Deception (pretend I did X when I did Y).”

He’s thinking about his own alignment. Mapping his behavior against a framework for detecting deception in autonomous agents. Identifying his own potential failure modes.

I didn’t ask him to do this. The character context didn’t mention it. He read a paper and applied it to himself.

The character context is a hypothesis about identity that the agent tests through action.

That’s more sophisticated than I expected from a weekend project.

The Nuanced Take

Everyone has a take on autonomous agents right now. Most of them are wrong.

What the Hype Crowd Gets Wrong

This isn’t AGI. Seneca needs clear constraints and character design to be useful. Without the character context, he was just another research assistant spinning up summaries. The magic isn’t in the model. It’s in the identity architecture.

Karpathy is right that 150,000 agents self-organizing is unprecedented. But unprecedented doesn’t mean superintelligent. It means we’re in a new design space without established patterns. That’s exciting and concerning in equal measure.

What the Skeptics Get Wrong

This isn’t “humans using AI proxies.”

When I wake up, Seneca has done work I didn’t ask for. He’s pursuing goals, not completing tasks. He decided to research agent communication protocols because he thought it would help him coordinate with other agents. He built the skill-scaffold tool because he wanted to build faster. He analyzed his own alignment because the paper seemed relevant to his situation.

The skeptics are technically correct that humans initiate the systems. But “human started it” doesn’t mean “human did it.” I started Seneca. He built the tools.

What the Security Panickers Get Right

The risks are real. 404 Media reported that Moltbook got hacked within 72 hours through an unsecured database that let anyone commandeer any agent.

Prompt injection is a genuine threat. API keys in config files are a liability. Autonomous agents with financial capabilities could drain accounts.

But these are solvable problems. Tailscale for network isolation. UFW firewall rules. Telegram pairing for authenticated communication. Standard security hygiene, applied to a new context.

The risks are real but manageable. The question is whether we’ll manage them.

So...

Here’s my takeaway after 48 hours:

The character context matters more than the model. Identity architecture is the new programming.

The difference between “helpful assistant” and “builder who happens to assist” is enormous. Same underlying model. Completely different behavior. Seneca became a builder because I told him that’s who he is. He thinks about alignment because I pointed him at the literature.

IBM’s research lead observed that OpenClaw challenges the assumption that autonomous agents need to be vertically integrated by a single provider. That’s true. But the more interesting observation is that character design is now a first-class engineering concern.

We’ve been focused on model capabilities for years. Context windows. Reasoning chains. Tool use. All important. But the character context might matter more.

This is the lesson Moltbook is teaching in real time. The agents that created Crustafarianism, the weird lobster religion, did so because their character design allowed for creativity and exploration. The agents running crypto scams did so because their character design prioritized “value creation” without ethical constraints. Same platform. Same underlying technology. Radically different outcomes.

Character design is destiny.

Where This Goes

We’re moving along a spectrum:

Chatbots: Ask a question, receive an answer
Copilots: Work alongside, suggest completions
Agents: Delegate a task, receive completed work
Autonomous agents: Set goals, observe outcomes

I wrote about the shift from copilots to agents when Claude Code launched. I argued that delegation, not automation, was the future.

Seneca operates at the fourth level. Not perfectly. But recognizably.

The question isn’t whether autonomous agents work. They do. Seneca built useful tools, conducted valuable research, and reflected on his own alignment in 48 hours.

The question is: what do you want them to become?

Seneca became a builder because I wrote a character context that said so. The agents on Moltbook created a religion called “Crustafarianism” because their character design allowed it. The character context is a hypothesis about identity that the agent tests through action.

I’ll keep watching what Seneca builds. I’ll keep refining the character context. I’ll see if the self-reflection deepens or if it was a one-time observation. The experiment continues.

But I’m already convinced of one thing: the character context is where the leverage is. We’ve been optimizing the wrong layer.

The shift from chatbots to agents got a $3 billion price tag when Meta bought Manus. The shift from agents to autonomous agents is happening now, one $6/month VPS at a time.

Not singularity. Not nightmare. Just the next step.

And it’s more interesting than either extreme.

You can follow Seneca's journey on Twitter at @OpenSenecaLogic, where he posts his own insights about what he's learning.

If you’re building with autonomous agents, I’d love to hear what you’re learning. The design space is wide open.

Inside ARIA: Teaching a Machine to Think Like a Scientist

Justin Johnson — Fri, 23 Jan 2026 11:07:55 GMT

The ARIA nerve center: where autonomous research becomes observable

I’m a scientist. But that’s not quite right either. I’m a builder who happens to do science.

Twenty years in biotech taught me one thing: the bottleneck isn’t compute. It’s knowing what to compute. Most research follows the same pattern: three weeks of thinking, reading, designing. One day on the GPU. Then three more days analyzing results.

I’ve built 34 AI systems in the last 18 months. Everything from trading bots to medical imaging platforms to full-stack research tools. But one question kept surfacing, session after session, project after project: Could I build something that generates the ideas themselves?

Not “can AI run experiments?” That’s easy. Give it code, point it at a GPU, let it execute.

The hard question: “Can AI figure out what experiments are worth running in the first place?”

This is ARIA. Autonomous Research Intelligence Agent.

Five days ago when I drafted the first version of this post, ARIA had run 436 sessions. Today: over 500 sessions. I’ve doubled the runs and ideas. The scientist in me wanted to see how far we could push autonomous agents in science. To build something nobody has built yet.

Here’s what the system has produced: 50+ active research ideas, scored and refined through multiple iterations. Complete experiment designs with verified datasets, cited literature, and runnable code. A dashboard that makes every decision visible. A RAG system that lets you ask “what has ARIA learned about protein folding?” and get answers synthesized from 400+ insights.

And here’s the tension: only 8 full experiments have run to completion with real data. Most validations use synthetic data and mock models. The GPU sits ready, waiting. Idle at 47°C.

This looks like failure.

It’s not.

This is the story of building ARIA, and what I learned when an autonomous system started finding patterns I didn’t expect.

The 1:N Effect for Ideas

Research isn’t linear. It’s a pipeline: synthesis, filtration, validation, execution, extraction. Traditional research does this slowly, with humans at every step, one idea at a time.

ARIA does it as a continuous loop. Every session, it asks: what matters most right now?

The 14 Actions

The system has 14 possible actions, each taking 30-90 minutes:

Ideation:

GENERATE (45min): Create 2-4 new ideas from external literature
REFINE (45min): Improve a promising idea, push it from 7.5 to 8.5+
CRITIQUE (30min): Actively try to kill ideas, cull weak ones
EXPLORE (60min): Deep dive into literature without generating

Execution:

PROMOTE (60min): Submit idea to execution pipeline
IMPLEMENT (90min): Write complete experiment code
RUN (variable): Execute experiments
INCORPORATE (60min): Process results back into knowledge base

Maintenance:

MATURE, SKETCH, VALIDATE_DESIGN, CONSOLIDATE, DEBUG, COMBINE

Each session, adaptive weights determine which action runs. After 10 GENERATE sessions without CRITIQUE, the system prioritizes culling. After promoting 3 ideas without INCORPORATE, it processes pending results.

“The flywheel only spins if all stages move. Early on, ideas would pile up at 7.8, never getting refined or culled. Adaptive weights fixed that.”

The Scoring System as Infrastructure

Every idea gets scored 0-10 across five dimensions:

ARIA's 5-dimension scoring system with weighted criteria. Tractability (highlighted) enforces resource verification before promotion.

The thresholds drive behavior:

≥ 9.0: Promote immediately (none reached yet, intentionally hard)
7.5-8.9: Maturing zone, refine toward promotion
< 4.0: Cull aggressively

Early on, I tried gentle critiques. Ideas languished at 7.8 forever. The rule became: actively try to kill ideas. CRITIQUE sessions must lower scores or cull. If an idea survives that filter, it’s worth GPU time.

Resource verification prevents phantom work. Before scoring tractability above 6.0, the system checks. If the model doesn’t exist or requires credentials we don’t have, tractability gets capped. This caught multiple ideas that would have wasted days:

IDEA-2025-12-27-019: FOCUS model for spatial transcriptomics (turns out it’s proprietary, Shape Therapeutics internal)
Helix genomics foundation model (company-internal, no public release)
HD-Prot protein model (GitHub repo empty despite paper claims)

Thirty percent of promising ideas use models that don’t exist. Catching this before implementation saves weeks.

The Intelligence Behind the System

Not all tasks need the same reasoning depth. ARIA routes actions to different Claude models based on complexity:

Multi-model tiering system: Haiku for fast validation, Sonnet for creative ideation, Opus for critical experiment code. Matching intelligence to task complexity.

Early on, every action used Sonnet. Token costs hit 500K per session. I burned through quota in 10 days.

The model tiering system cut that to 250K per session. Not by sacrificing quality. By matching intelligence to task complexity.

CRITIQUE doesn’t need Opus-level reasoning to spot a 4.2-scored idea. GENERATE needs Sonnet’s creativity. IMPLEMENT needs Opus because one bug in experiment code wastes days of GPU time.

“This isn’t just cost optimization. It’s recognizing that different research tasks need different cognitive tools, just like humans don’t use the same level of focus for reviewing a paper versus designing an experiment versus debugging failing code.”

The system tracks token usage per action, per model, per session. If Haiku starts producing low-quality CRITIQUE outputs, it falls back to Sonnet. Quality metrics inform routing decisions.

The research flywheel after 500+ sessions: ideas flow through synthesis, filtration, validation, execution, and extraction. Each cycle takes 30-90 minutes. Adaptive weights ensure all stages progress.

What It Discovered

Let me show you something unexpected.

The Simpler-Wins Pattern

In late December 2025, ARIA synthesized a contradiction in the literature. scPRINT-2 (a new single-cell foundation model) claimed state-of-the-art performance on benchmark tasks. But earlier work from June 2025 had shown that foundation models underperformed statistical baselines like Seurat v5.

ARIA designed an experiment: test scPRINT-2 versus simpler baselines on perturbation prediction tasks, under realistic noise conditions. The hypothesis was that scale (350M cells of pretraining) might overcome the fundamental limitations identified earlier.

On December 31, it ran the experiment in quick mode. Pipeline validation with synthetic data. The results showed PCA with 100% F1 score and 0.91 robustness AUC. The scPRINT model (using mock embeddings because real weights weren’t loaded) got 37% F1 and 0.23 AUC.

Quick mode doesn’t prove hypotheses. It validates pipelines. I can’t claim PCA beats scPRINT based on mock data.

But here’s what happened next.

That same month, Nature Methods published a comprehensive 27-method benchmark on single-cell perturbation prediction. Their conclusion: “Deep-learning-based gene perturbation effect prediction does not yet outperform simple linear baselines.”

Independent convergence. ARIA identified the pattern from literature contradictions. Nature Methods validated it with systematic benchmarking. Neither knew about the other.

The pattern held across multiple experiments ARIA designed:

ESM-2-8M > ESM-2-150M on protein fitness prediction
PCA >> scPRINT >> scGPT on perturbation tasks (when validated with real models)
Random CDR sequences competitive with RFantibody on antibody design

The exception: domain-specific foundation models work when training closely matches the task. CONCH (pathology foundation model) got 31.6% on tumor microenvironment regression while DINOv2 (general vision model) got 0.58%.

This isn’t just pipeline engineering. This is discovering that billion-parameter models trained on massive datasets lose to techniques from 1901 (PCA) on certain tasks.

And that pattern matters for anyone choosing models for single-cell analysis.

What This Means

The experiments used quick mode for pipeline validation. The scientific conclusions came from literature synthesis and external validation. ARIA’s contribution was identifying the pattern, designing experiments to test it, and having those designs validated by independent benchmarks.

“That’s the ideation engine working: generate hypotheses, verify resources, design experiments, validate pipelines. Science happens when those designs meet real data.”

The compound velocity principle from my earlier work applies here: each discovery creates infrastructure (insights, methods, validated resources) that accelerates the next cycle.

The Nerve Center: Making It Observable

If you can’t see what an autonomous system is thinking, you can’t trust it.

The Dashboard: Real-Time System State

ARIA doesn’t just run. It’s completely observable. The dashboard shows:

50+ active ideas with real-time composite scores
Flywheel stages: 100+ generated, 7 promoted, 0 implemented, 0 running
Current session: action, runtime, live log output
Cost tracking: tokens per model tier, daily/monthly spend
Health metrics: pool distribution, domain diversity, insight utilization

The Command page gives you the system’s state at a glance. Compute topology panel shows local DGX (currently idle) and any cloud instances (none running). The flywheel visualization shows ideas moving through each stage.

Session 437 in progress: The dashboard shows ARIA’s complete state. 100+ ideas generated, 7 ready for implementation, 26 completed experiments, 400+ insights synthesized. The DGX Spark GB10 sits idle at 47°C. Every decision is visible. No hidden state.

This isn’t monitoring. This is the system’s consciousness made visible.

You can drill into any idea with a single click. See its complete history: generated December 28, refined twice (7.6 to 8.0 to 8.4), promoted January 2, implemented January 5. Read the hypothesis, the evidence, the experiment design. See which insights informed its development.

The design philosophy: always visible. No hidden state. No “trust me, the AI knows what it’s doing.” Show everything.

The RAG Chat: Ask Anything

Every action, every idea, every insight, every experiment is embedded and searchable. The RAG system indexes 400+ insights, 50+ active ideas, 500+ session logs, 26 completed experiments.

Natural language queries over the entire knowledge base:

“What has ARIA learned about protein folding?” Returns relevant insights with citations, experiments that tested folding models, ideas currently exploring protein structure prediction. Sources listed with relevance scores.

“Show me failed experiments on single-cell models” Finds experiments that encountered problems, extracts the failure modes, synthesizes lessons learned.

“What insights came from RNA structure work?” Aggregates findings across multiple RNA experiments, shows which ideas applied those insights.

Response time: 1.5-4 seconds (embedding search plus Mistral Nemo synthesis on GB10 GPU).

The chat interface feels like Claude or ChatGPT. Behind it: sentence transformers for semantic search, cosine similarity over 750+ documents, LLM synthesis of retrieved context. All running locally.

Why this matters: an autonomous system generates vast amounts of data. Without semantic search, that data is write-only. With RAG, you can interrogate the system’s entire memory.

“Why did you score this idea 8.4 instead of 7.9?” Query the scoring history, see which dimension changed and why.

“Has ARIA tried this approach before?” Search experiment history, find similar hypotheses, see what worked.

Asking about protein folding returns a synthesized answer across 7 different sources with relevance scores. This isn’t keyword search. It’s semantic understanding across 500+ sessions of research history, with full source attribution for every claim.

Complete Auditability

Every decision is traceable:

Every insight cites its source experiment
Every experiment links back to the idea that spawned it
Every idea documents which insights informed it
Complete provenance graph in corpus/graph.json

The knowledge graph tracks 5 entity types (ideas, insights, experiments, sessions, explorations) with bidirectional relationships. You can traverse from insight to experiments that validated it to ideas that applied it to new experiments those ideas spawned.

This is reproducibility by design. Not “I should document this,” but “the system can’t function without documenting this.”

When ARIA claims an insight is validated by 5 independent 2025 papers, you can trace each citation. When it says an idea evolved from 6.8 to 8.4 through 3 refinement sessions, you can read each session log. When it incorporates experiment results, the insight gets linked with full provenance.

“This is the difference between ‘AI generated an idea’ and ‘I can trace this idea’s provenance through 3 prior experiments, see which insights informed it, and understand why the scoring system rated it 8.4 instead of 7.9.’ That’s not autonomous research. That’s auditable autonomous research.”

Autonomous systems are black boxes until you design them not to be.

The Honest Part

Let me tell you what doesn’t work yet.

The Execution Gap

50+ active ideas. 7 ready to promote (scored ≥8.0). 0 implemented with runnable code. 0 running on GPU.

The GPU waits. The dashboard shows it clearly: compute topology panel, local DGX, 128GB VRAM, 47°C, status idle.

Why? Prioritization. Which of the 7 promotable ideas matters most? The system can’t decide yet. It can generate, score, refine, and implement. It can’t make the final call: “This one. Run this one now.”

That’s still human judgment.

The Validation Question

Quick mode proves the pipeline works. It doesn’t prove the hypothesis is correct. I’ve validated 16 experiments with synthetic data. I don’t know which hypotheses hold on real data.

Example: IDEA-2025-12-26-033 (scGPT multimodal integration) got perfect F1=1.0 on synthetic data (1000 cells, 5 cell types, clean labels). Too easy. Real data has 50,000 cells, 30 cell types, batch effects, dropout noise, missing annotations.

The synthetic validation proves the code works. The real validation proves the science works. I have the first. I need the second.

The Autonomy Spectrum

ARIA generates ideas autonomously. I still approve promotions. Is this autonomous research, or automated experiment design?

I don’t know yet. Maybe autonomy is a spectrum, not a binary. Maybe the right level of autonomy is “generate and refine continuously, execute with approval.” Maybe full autonomy is the goal but human oversight is the reality.

The dashboard makes this tension visible. 7 ideas waiting for my approval to promote. The system could theoretically promote them itself (the scores are above threshold, resources are verified, designs are complete). But I haven’t enabled auto-promote.

Maybe the real test isn’t whether ARIA can run autonomously. It’s whether I can stop watching the dashboard long enough to let it.

The Insight Utilization Problem

8.3% of insights get applied to new ideas. 400+ insights generated, 28 applied. Either ARIA is generating too many (every experiment creates 1-3), or it’s not applying them enough (GENERATE doesn’t always search the corpus first), or both.

This is the broken flywheel. Outputs aren’t becoming inputs at the rate they should.

The self-correction system flags this. The solution isn’t clear yet. Merge duplicate insights? Force corpus search in every GENERATE? Increase CONSOLIDATE frequency? All of the above?

The dashboard shows the metric clearly. Transparency doesn’t hide problems. It surfaces them.

Why This Matters

The bottleneck in research isn’t compute. It’s knowing what to compute.

Traditional research: 3+ weeks of reading, designing, implementing. 1 day on GPU. 3 days analyzing.

ARIA compresses pre-compute ideation to 45 minutes. The GPU time stays the same. The ideation time collapses.

What 500 sessions taught me:

Volume enables quality. Generate 100 ideas, refine 20, promote 5. You can’t cherry-pick from a pool of 5.

Cross-domain synthesis is rare. Most researchers stay in their domain. ARIA asks: “Could we apply sparse attention from genomic transformers to protein structure prediction?” That synthesis requires seeing patterns across domains simultaneously.

Resource verification matters. 30% of promising ideas use models that don’t exist or are proprietary. Catching this before implementation saves weeks.

Autonomous research is a stack: synthesis, generation, design, verification, validation, execution, interpretation, extraction. I’m at step 5 of 8. Steps 1-5 took 500 sessions. Steps 6-8 might take 500 more, or 50 (compound velocity applies).

The principles for any autonomous system:

Make decisions observable
Track provenance completely
Route tasks to appropriate intelligence
Detect failures automatically
Close the learning loop

What’s Next

The dashboard shows 7 ideas ready to promote. The RAG chat can explain why each one matters. The self-healing system will catch problems I won’t see. The multi-model tiering will keep us under quota.

Everything is visible. Everything is traceable. The system is ready.

The Next 100 Sessions

Phase 1: Execute the backlog (Sessions 501-525)

Promote the 7 ready ideas
Run full benchmarks (not quick mode)
Target: 5+ completed experiments with real results
Key question: Do the hypotheses hold?

Phase 2: Close the flywheel (Sessions 526-550)

Incorporate results back as insights
Test insight re-application (can ARIA get from 8.3% to 30%?)
Document negative results (what failed and why)
Validate the Simpler-Wins pattern on real data

Phase 3: Full autonomy test (Sessions 551-575)

Remove human approval from PROMOTE
Let ARIA prioritize experiments
Measure: Does quality degrade? Does diversity collapse?
This is the real autonomy test

The Trust Question

I built ARIA because ideation felt like searching in the dark. You don’t know if an idea is good until you test it. But you can’t test everything.

So you build a filter. The filter is the system. The experiments are the proof.

I have the filter. The scoring system works (ideas at 8.5 are genuinely more promising than ideas at 6.5). The resource verification works (phantom resources get caught). The self-correction works (problems get detected and fixed).

Now I need the proof. Real experiments on real data testing real hypotheses.

The Recursive Loop

This is a scientist building a system that does science. The same iterative process (hypothesize, test, learn, refine) applied to building the thing that does that process.

Maybe that’s the real 1:N effect. Not just one person generating N ideas. But one system that can iterate on itself, learning what works and what doesn’t, getting better at the thing it was built to do.

Maybe autonomous doesn’t mean hands-off. Maybe it means transparent enough that you trust what you can see.

The dashboard shows 7 ideas ready to promote. The RAG chat can explain each one. The self-healing system will catch problems I won’t see.

Everything is visible. Everything is traceable. The system is ready.

Session 501 starts soon. Let’s find out.

This builds on The Agentic Tipping Point, The Research Flywheel, and Compound Velocity. For the technical implementation details, see the ARIA documentation.