Blog

Blog

When will we have the GPT moment in biology?

When will we have the GPT moment in biology?

Oded Kalev

|

Jan 6, 2026

Many people are asking me, "When will we have the GPT moment in biology?"

I’ve heard this question so many times recently that I've decided to write about it. The standard answer is "we just need more data."

That is factually wrong.
In terms of raw volume, biology wins. The Sequence Read Archive (SRA) holds over 50 Petabytes of data. We are drowning in data. The problem isn't the Volume of Data.
It is the Scarcity of Ground Truth.

In NLP, the internet gives us free labels. In Biology, we have 50PB of raw sequences but only ~560,000 labeled entries in Swiss-Prot. That is 0.000001% of the raw data. We are trying to learn physics from a dataset where 99.999999% of the answers are missing.

This leads to the Topological Trap.
Text forms a dense, continuous manifold where you can smoothly interpolate between meanings. Biology forms isolated, hyper-dense clusters floating in a vast, empty void.

A standard 100-amino acid protein has 20^100 possibilities, yet evolution has explored less than 10−80% of this space. As Nobel Laureate Frances Arnold argues, the universe of possible proteins is "vast and mostly empty."

This sparsity, combined with tokenization, creates the Interpolation Fallacy.
Text uses semantic tokens (subwords). Biology uses chemical tokens (amino acids). A single "Alanine" has no meaning in isolation. When models attempt to interpolate between evolutionary clusters using these tokens, the physics breaks down. You don't get a "hybrid" protein. You get aggregated "gibberish."

To make matters worse, we train on "Positive-Only" Data (survivors) and Noisy Labels (stochastic assays). Our models don't know what not to design, and the ground truth itself has massive error bars.

Not only is the math more challenging, but the economics are also brutal.
If a language model hallucinates, a human spots it instantly for pennies. If a bio-model hallucinates a structure that looks valid but sits in the "void," the only way to catch it is wet-lab synthesis. We are navigating a dark ocean where the "spellcheck" costs $15,000 and takes two weeks.

The breakthrough won't come from bigger GPUs. It will come when we stop forcing biology into text-based architectures and adapt our math to the sparse, discontinuous nature of biological space.

This is exactly what we do at Converge Bio to bring the "Converge moment" to biology. 😉