Foundation Models in Biology Hit the Noise Floor
A new benchmark of 600+ model configurations reveals that biological foundation models do improve perturbation predictions, but experimental variability sets a hard ceiling. What this means for techbio investment.
Foundation models (FMs) have transformed natural language processing and computer vision. The question for biology is whether the same paradigm, pre-train a large model on massive data, then apply it to specific tasks (either directly, or with minimal adaptation), can help predict which drugs will work in humans. In practice, we cannot test that directly (random screens in humans are not an option), so the field works on a proxy: predicting how cells respond to interventions in the lab. In biology, these interventions are called perturbations: you either disable a gene to see what breaks (a gene knockout, typically done with CRISPR), or you expose a cell to a drug (a chemical perturbation) and measure what changes in gene expression. This proxy is useful but has limits, and the gap between “works on cells” and “works in patients” is one of the central tensions of this piece. If FMs can at least predict the outcome of these cell-level experiments without running them, the payoff is already significant: screen perturbations in silico before running expensive wet lab experiments, and compress the design-test-learn cycle from months to days.
Cole, Huizing et al. at GenBio AI just released the most comprehensive benchmark on this question to date (“Foundation Models Improve Perturbation Response Prediction,” bioRxiv, Feb 2026). The setup: take all the leading single-cell FMs (AIDO.Cell, scGPT, scPRINT, Geneformer, scFoundation), add gene-level embeddings, molecular fingerprints, and chemical FMs, and test over 600 configurations of roughly 20 distinct models on two large perturbation datasets (Essential and Tahoe-100M).
Some context first. Last year, Ahlmann-Eltze et al. published a damaging result in Nature Methods: none of the deep learning models they tested beat deliberately simple linear baselines. The conclusion was hard to dodge. Maybe biological FMs were not learning anything useful beyond predicting the average training response.
GenBio’s answer is more nuanced, and it comes from two directions. First, the field has moved on: several of the models GenBio benchmarks (notably AIDO.Cell, released late 2024, and scPRINT, published in Nature Communications in early 2025) did not exist when Ahlmann-Eltze ran their tests. Second, and perhaps more importantly, GenBio shows that the answer depends on how you frame the prediction task. If you ask the model to predict the exact magnitude of gene expression change for every gene (log fold change regression), most FM embeddings still do not convincingly beat negative controls. But if you ask a different question, which genes are significantly affected by the perturbation (DEG classification), several FM embeddings clearly do better than baselines. How you define the prediction task changes the answer.
What kinds of models work best? In plain terms: the ones that know something about the biology of the target being perturbed, not just the chemistry. Models that encode what the perturbation target actually is, what gene is being knocked out, or what receptor a drug binds to, outperformed those that only encode the molecular structure of the drug itself. This makes intuitive sense: knowing that you are targeting gene X in pathway Y gives the model useful prior knowledge about which downstream genes are likely to be affected. GenBio also shows that combining information from multiple specialised FMs (one for cell state, one for gene identity, one for compound structure) into a single prediction framework pushes performance further still, treating perturbation prediction as an integration problem rather than a single-model problem.
Key Finding #1: No model can outperform the noise in the data it learns from
Here is the finding that should anchor how anyone thinks about biological FMs. The GenBio benchmarks include a simple but revealing control: they measure how much the results of the same experiment vary when you repeat it under identical conditions (the “Experimental Error” baseline). This quantifies the irreducible noise in the system, the variability that comes not from the model but from the biology itself: random fluctuations in gene expression, small differences in how cells are handled, batch-to-batch variation in reagents. The best FM configurations in the paper approach this ceiling but, by definition, do not breach it. No amount of pre-training data or parameter scaling moves this boundary, because the limit is not set by the model: it is set by the measurement. And this is on K-562, a well-characterised immortalised cell line. When you shift to primary patient-derived material (organoids, explants), the noise floor rises further. Gustave Ronteix, CTO of Orakl Oncology, confirms they observe the same pattern with their patient-derived organoid data. This is not something that better engineering will fix. The ground truth itself is noisy, and no model can be more accurate than the data it learns from.
That said, the field is closing in on this ceiling faster than the GenBio paper suggests. Ihab Bendidi of Valence Labs/Recursion points to models that GenBio did not benchmark but that represent the current state of the art. State, from Arc Institute, learns a cell embedding from 167M observational cells and a perturbation transition model from 100M+ perturbed cells, conditioning on cell state to predict how it shifts under intervention. It is increasingly regarded as the best public FM for this task. TxPert, from Valence Labs/Recursion (on which Ihab is a co-author), takes a different approach: it encodes perturbations through biological knowledge graphs (protein-protein interactions, pathway structure) and combines them with a basal cell-state embedding via a simple MLP. On the same Essential dataset that GenBio benchmarks, TxPert is already tying with experimental error. Xaira’s X-Cell goes bigger: a 4.9B parameter diffusion model trained on 25.6M perturbed cells across 16 biological contexts, incorporating priors from protein language models, gene interaction networks, and morphological profiles. It claims to outperform even State. What these three models share is telling: they all combine biological prior knowledge with cell-state conditioning and purpose-built perturbation modelling. In plain terms, they do not just take a pre-trained cell model and ask it to predict perturbation effects as an afterthought. Instead, they are built from the ground up for the perturbation task: they take into account what the cell looks like before the intervention (cell-state conditioning), they encode what we already know about how genes relate to each other (biological priors from interaction networks, pathways, protein structure), and they model the perturbation as a transition from one state to another rather than a static prediction. This is a fundamentally different approach from the GenBio benchmark, which mostly takes frozen embeddings from general-purpose FMs and feeds them into simple classifiers. The field, catalysed by Arc Institute’s Virtual Cell Challenge last autumn, is in a surprisingly healthy place.
Ihab also flagged an important caveat about the GenBio benchmark itself: its experimental setup does not actually perform perturbation prediction in the standard sense. Rather than conditioning on cell state and predicting the outcome of a specific intervention, the benchmark tests whether gene embeddings extracted from unperturbed cells correlate with perturbation effects. That is an interesting but distinct question from what the field usually calls perturbation prediction. It shows that FM representations carry perturbation-relevant signal, but not that they can predict what happens to a given cell when you intervene on it.
Key Finding #2: Knowing biology still beats scaling compute
This second finding is worth spelling out because it cuts against the dominant narrative in AI. In language and vision, bigger models trained on more data have reliably outperformed smaller, more curated approaches. Biology does not seem to work that way, at least not yet. The best-performing embeddings in the GenBio benchmark are not the largest or most expensively pre-trained. They are the ones that encode what we already know about biology: which proteins interact with each other (gene-gene interaction networks), what the perturbation target is, and how cell states are organised in curated reference atlases. Models trained on raw transcriptomic data without these priors pick up statistical regularities but miss the causal structure, the regulatory logic, the pathway architecture, the physical chemistry, that actually determines how a cell responds to an intervention. In short, biological data is not like text. It has structure that comes from physics and evolution, and FMs that bake in that structure generalise better. More parameters and more tokens are not a substitute for better biological inductive biases.
From noise floor to translatability: insights from the field
These findings echo what keeps coming up in our conversations with practitioners. In our recent TTY Sessions, organised by my cofounder Kevin Kuipers, about bio & techbio (with Amine Saboni (Pruna.ai), Ashley van Heteren (Radiogenesis), Chouaib Meziadi (Epics Biotechnology), Cyril Veran (Living Models), Felix Raimundo (TychoBio), Guillaume de Luca (ZebraMed), Ihab Bendidi (Valence Labs/Recursion), Jean du Terrail (Living Models), Jeremie Kalfon (ENS/Pasteur), Joana Duarte (Biolevate), Julien Dusquesne (ScientaLab), Maziyar Panahi (OpenMed), Pierre Manceron (Raidium), and Sophie Monnier (InstaDeep)), the consensus was clear: in computational biology, progress is constrained far more by data than by models. Biological datasets are small, inconsistent, poorly labelled, and riddled with batch effects that cluster by lab or protocol rather than by biology. The real advantage lies not in model architecture but in strong data pipelines, experimental systems, and tight feedback loops between models and wet lab validation. There is also a talent problem: the bioinformaticians and computational biologists who understand both the biology and the modelling are rare, and increasingly get pulled into pure AI roles where the pay is higher but the domain expertise goes to waste.
Felix Raimundo, CEO of TychoBio, pushed the argument one step further in our Discord, and it cuts to the core of the techbio value question. The pharmaceutical industry already generates in-vitro and in-vivo data before committing to clinical trials. Even with that ground truth, the industry-wide clinical trial success rate is only around 9%. Most techbio AI companies train on the same data (or worse). Even with perfect predictions, all they achieve is predicting the result of experiments that already fail to translate to humans 91% of the time. Being able to predict a broken proxy faster does not fix the proxy. The real question is whether you are making more drug candidates (pipeline size) or better drug candidates (pipeline quality). If the bottleneck in your therapeutic area is throughput (how many candidates you can test per unit of time and money), faster screening helps. But for most areas, the bottleneck is that drugs fail in humans, and no amount of in silico screening on non-translatable data changes that. Two things actually move the needle on quality: generating data with higher translatability (patient-derived organoids, complex co-culture systems), and working on therapeutic modalities where the mechanism is so direct that preclinical results reliably predict clinical outcomes.
For those wanting to go deeper on this, Felix points to two resources worth the time. The clinical trial abundance movement is making the case that AI will not automatically accelerate drug development unless the underlying experimental models improve. And Jack Scannell, the researcher who coined Eroom’s law (the observation that the cost of developing a successful drug has been doubling roughly every nine years), holds a bearish position on AIxBio and argues that what will actually help is more and better human data, not better models trained on the same animal and cell-line proxies.
The Four Criteria of Our Current Techbio Thesis
1. Problem specificity. Horizontal perturbation prediction platforms that promise to predict anything in any cell type are fighting the noise floor across all fronts simultaneously. From what we can tell, the companies with a credible path to clinical utility are those that constrain the prediction task to a defined modality, a defined cell context, and a defined readout. This is how the best FM configurations in the paper succeed: they narrow the prediction target to a regime where signal exceeds noise. We back companies that solve a specific biological problem deeply, not those that build general-purpose prediction engines. The early techbio wave raised massive rounds to build generalised models of biology without clear therapeutic paths, and most of those companies have struggled. The strategies that work are more incremental, focused on narrow therapeutic areas with clear constraints and specific clinical problems.
2. Wet-dry loop velocity, on translatable data. If model performance saturates near experimental noise, the binding constraint on progress is not compute but data quality. But not all data is equal. A fast loop on standard cell-line assays just predicts a broken proxy faster. The competitive moat shifts to whoever can generate the next batch of clean, task-specific, and translationally relevant biological data fastest, and feed it back into the computational layer. The FM becomes a leverage tool inside a rapid experimental loop, not a standalone prediction oracle. The companies we favour own both the data generation infrastructure (patient-derived organoid platforms, automated screening rigs, complex co-culture systems) and the modelling layer, so that each wet lab cycle produces training data for the next dry lab iteration. The moat is the speed of this loop, on data that actually predicts what happens in patients.
3. Mechanistic legibility. This is the second lever from the translatability argument: work on therapeutic modalities where the mechanism of action is physically deterministic, meaning the mapping from design to outcome follows well-understood laws, and where strong preclinical results therefore have a higher chance of translating to the clinic. Antisense oligonucleotides (ASOs) block gene expression through direct RNA base pairing. Radioligand therapies (RLTs) deliver radiation to tumour cells via a targeting vector governed by binding kinetics and decay physics. In both cases, the noise floor is structurally lower because the mechanism admits fewer degrees of freedom, and the gap between preclinical and clinical results is narrower. Our conviction is that companies working on these mechanistically direct modalities will, over time, demonstrate clinical success rates well above the 9% industry average, precisely because the physics constraining the outcome is well enough understood for preclinical results to translate. Contrast this with small molecule polypharmacology or immuno-oncology combination therapies, where cascading stochastic interactions make the prediction target high-dimensional, the noise floor high, and the translational gap wide. The GenBio paper does not test these regimes directly, but the logic extends: more mechanistic degrees of freedom mean more noise and lower translatability.
4. Rich data for constrained mechanisms. Does mechanistic legibility mean you should limit yourself to simple readouts (readouts being what you actually measure after the perturbation: a single number like knockdown percentage, an image, or a full profile of gene expression changes across the genome)? Not necessarily. Standard drug development pipelines generate sparse data, often just a few hundred data points per programme. For mechanistically constrained problems with a narrow prediction target, that can already be enough to build useful models. But for learning complex mappings, like predicting how a given sequence design affects expression across multiple cell types, hundreds of data points are far too few. The question is always whether the dataset is rich enough relative to the complexity of what you are trying to learn. When you invest in generating rich readouts (full transcriptomic sequencing across multiple cell types) for thousands of candidate molecules, you can build datasets large enough for the model to learn the rules governing efficacy, off-target effects, and cell-type specificity. The mechanism must still be lawful enough for those rules to exist, and that is the key insight: the value of rich data depends entirely on whether the underlying mechanism is constrained enough for the model to learn from it. Rich data on a well-understood mechanism (like RNA hybridisation) compounds into real predictive power, because there are genuine patterns to find. Rich data on an unconstrained mechanism (like multi-target polypharmacology) is a different story. When a small molecule binds to multiple proteins with different affinities, each triggering its own signalling cascade, and those cascades interact non-linearly depending on cell type, what you measure at the end (gene expression changes) is the aggregate of many overlapping processes, and you cannot tell from the readout which target caused which effect. The key causal variables, actual binding events, pathway activation states, are not in the data. The model is not learning from noise because the experiment was sloppy. It is learning from noise because the important variables are hidden and the observable output conflates them. This is why we pay close attention to the data generation strategy as much as to the model architecture.
What ties these four criteria together: at this stage of the field, we think the best techbio investments are in companies where the biology is lawful enough for foundation models to add real value, and where the team controls the experimental infrastructure to keep pushing the noise floor down. That view may evolve as the models and the data improve. But for now, the FM is necessary but not sufficient. The moat is the loop.
