When the AI Picked a Drug Nobody Was Testing

How teaching an AI to read cells as text revealed a context-dependent cancer mechanism

Oct 16, 2025

The challenge facing cancer immunotherapy researchers is maddeningly specific: some tumors simply refuse to show up on the immune system’s radar.

These “cold” tumors don’t display enough molecular flags (antigens) on their surface to trigger an immune response. Even when there’s a faint signal, a weak interferon response flickering in the tumor microenvironment, it’s not enough. The immune system looks right past them.

Researchers at Yale and Google had an idea. What if there was a drug that could amplify that weak signal, but only in tumors where that signal already existed? Not a drug that worked everywhere (those tend to be toxic) but one that acted as a conditional amplifier, turning up the volume only where there was already something to amplify.

Finding such a drug through traditional methods would mean testing thousands of compounds in complex experimental setups that recreate the tumor immune environment. It would take years. So they asked an AI.

Teaching Machines to Read Cells

The model they used, Cell2Sentence-Scale 27B, was trained to read cellular biology the way language models read text. Feed it single-cell RNA sequencing data (snapshots of which genes are active in individual cells) and it converts them into sequences of gene names ordered by expression level. The researchers call these “cell sentences.”

Train a large language model on enough of these sentences (over a billion tokens worth, drawn from 57 million human and mouse cells), and something unexpected happens: the model starts to reason about cellular behavior.

But this was a different kind of problem. The researchers needed the AI to understand context. A drug that does nothing in isolation but becomes potent when interferon is present. It’s the difference between asking “what treats cancer?” and “what treats cancer only when condition X is already met?”

Smaller models trained the same way couldn’t handle it. The 27-billion-parameter version could.

A Virtual Screen of 4,000 Drugs

The team designed a two-stage test. In the first stage, the model analyzed real patient samples with tumors showing intact immune interactions and that faint interferon signal. In the second, it looked at isolated cancer cell lines with no immune context at all. Then it virtually screened more than 4,000 compounds.

The model picked silmitasertib, a kinase inhibitor that had been tested in clinical trials for other cancers but hadn’t shown particularly impressive results. Not an obvious choice. The prediction was specific: silmitasertib would amplify antigen presentation by about 50% when combined with low-dose interferon. Without interferon, it would do essentially nothing.

Then came the part that matters: testing it. The computational prediction is interesting. The wet lab validation is what counts.

The prediction held. In human cells, silmitasertib combined with low-dose interferon increased antigen presentation by approximately 50%. Alone, it had no effect. The context-dependence the AI predicted was real.

What This Doesn’t Mean (Yet)

This doesn’t mean a cancer drug is around the corner. What worked in cell culture will need to work in mouse models, then in human trials, a process that typically takes years and usually fails. The preprint hasn’t been peer-reviewed yet. The mechanism by which silmitasertib amplifies interferon signaling isn’t fully understood. Yale researchers are working on that now, and testing other AI-generated predictions in different immune contexts.

But it does mean something changed in how this hypothesis was generated. No human researcher was studying silmitasertib for this particular application. The AI identified a context-dependent effect that wouldn’t have been obvious from existing literature or standard drug screens.

The Scaling Question

The research builds on earlier work that showed biological models follow scaling laws similar to those seen in natural language processing. Make the model bigger, and it gets better at more than just the tasks it was trained on. It develops new capabilities.

In this case, the capability was contextual reasoning. The 27-billion-parameter model could distinguish between “a drug that increases antigen presentation” and “a drug that increases antigen presentation only when interferon is present.” Smaller models the team tested couldn’t make that distinction.

Whether this scaling pattern continues (whether a 100-billion or 500-billion parameter model trained on cellular data would develop even more sophisticated reasoning) is an open question. The compute required to train C2S-Scale was already substantial, using Google’s TPU v5 infrastructure. Going bigger gets expensive fast.

What’s Actually New Here

Drug discovery AI isn’t new. Plenty of models predict drug-target interactions or screen compounds for specific effects. What’s different is the type of reasoning required. Most screening tools look for simple associations: does compound X affect protein Y?

C2S-Scale was asked to find something more subtle: a compound whose effect depends entirely on what else is happening in the cell.

That kind of conditional logic is harder to encode in traditional screening approaches. It requires understanding the biological context, not just molecular structures. The model had to learn not just “what” but “when.”

The other difference is in how the model was trained. Most biology-specific AI tools are built from scratch for biology. C2S-Scale started with a large language model (Google’s Gemma) and adapted it by training on “cell sentences” (biological data formatted as text). The hypothesis is that language models already good at reasoning about context in text might transfer some of that ability to biological contexts.

Based on this result, the hypothesis seems worth pursuing.

Open Questions

The team released the model weights, code, and training details publicly. Unusual for work involving major tech companies, where AI models are often kept proprietary. The weights are on Hugging Face under a CC-BY-4.0 license. The code is on GitHub. The preprint is on bioRxiv, not yet peer-reviewed.

This matters for replication. Other labs can now attempt to reproduce the silmitasertib finding, test the model’s other predictions, or train similar models on different datasets. In AI-driven science, where it’s often unclear whether results come from the model, the data, or the specific training procedure, replication is how you figure out what actually works.

It also means researchers without access to massive compute resources can use the trained model. Fine-tuning a 27-billion-parameter model for a specific biological question is expensive but manageable. Training one from scratch is not.

What This Doesn’t Solve

Context-dependent drug effects are useful, but they’re not the only problem in drug discovery. The model can suggest compounds worth testing. It can’t predict toxicity in humans, off-target effects, pharmacokinetics, or any of the other reasons drugs fail in clinical trials. Most do fail, even ones that look promising in cells.

The model is also limited to the biological contexts it was trained on. If a relevant cell type or condition isn’t well-represented in the training data (57 million cells sounds like a lot, but biological diversity is vast), the model might miss it or reason incorrectly about it.

And there’s the question of mechanism. The model predicted silmitasertib would work. It didn’t explain why. Understanding the mechanism matters both for improving the drug and for generating new hypotheses. That work still requires traditional biology.

The Longer View

If this approach pans out (and that’s still an “if” pending further validation), it suggests a different workflow for early-stage drug discovery. Instead of screening thousands of compounds in the lab, you screen them computationally first, then test the most promising hits. The model doesn’t replace experimental biology; it changes what experiments are worth doing.

The cold tumor problem is still unsolved. But there’s now a candidate mechanism to test, generated faster than traditional methods would have found it. Whether that mechanism becomes a therapy depends on a lot of work that hasn’t happened yet.

The model and data are available now for anyone who wants to try.

Resources:

Model: HuggingFace - vandijklab/C2S-Scale-Gemma-2-27B
Code: GitHub - vandijklab/cell2sentence
Preprint: bioRxiv (not peer-reviewed)
Lab: van Dijk Lab at Yale

Run Data Run

Discussion about this post