20M
Cells in training
From public single-cell atlases
4,479
Donor samples
Across 350+ diseases
SOTA
Performance
Outperforms PaSCient
The scientific
challenge


Figure 1 · The ConvergeCELL architecture, from the bioRxiv preprint.
The Model
ConvergeCELL is a patient-level foundation model trained on 20 million cells across 4,479 donor samples spanning more than 350 diseases.
Rather than learning from cells in isolation, the model represents entire patient samples in a single embedding space. Through supervised contrastive learning organized at the disease-family level — oncological, immune-inflammatory, metabolic-vascular, and others — ConvergeCELL learns shared pathophysiological axes that transfer to held-out conditions. The result is a model that arrives at a new patient cohort with biologically informed priors, not a blank slate.
A knowledge distillation module extends the same representation to bulk RNA-seq, the format most clinical cohorts and retrospective studies actually live in. One unified representation, two transcriptomic data types, validated on both.

NOW OPEN ON HUGGING FACE
The patient representation engine behind ConvergeCELL is available open source, as a resource for the scientific community. The full science is published on bioRxiv.
We validated ConvergeCELL across three independent disease cohorts the model had never seen during training. Each was chosen to stress-test a different dimension of the platform: tissue, modality, and biology. In each case, ConvergeCELL was applied zero-shot, with no fine-tuning, no retraining, and no disease-specific adaptation.
0.87
vs. PaSCient · 0.67
Belimumab target rank
26
/ 13,000
TNFSF13B · top 0.2% of genes
0.72
vs. PaSCient · 0.50 · Standard ML · 0.41
Belimumab target rank
3
/ 11,000
TNFRSF17 (BCMA) · PaSCient ranked 6,315
Top-10 attributed
6/10
Genes are known sepsis biomarkers
Modality transfer
scRNA
→ bulk
Trained where science is rich, deployed where data is routine
Across all three cohorts

From genes
to hypotheses.
261 donors · PBMCs · Single-cell RNA-seq
Identifying disease-associated genes is one part of the workflow — and a readout of what the model has learned. ConvergeCELL also includes a hypothesis generation agent that connects a large language model to biomedical knowledge bases: PubMed, Open Targets, ClinicalTrials.gov.
For each candidate gene, the agent classifies the strength of existing evidence, then generates a structured mechanism-of-action hypothesis covering the gene's pathway context, mechanistic role in disease biology, and potential therapeutic implications.
Direct
Clinical evidence linking gene to disease.
Indirect
Pathway-level or analogous evidence.
No evidence
Novel candidate — flag for in-vitro validation.
The end output is not a gene list. It is a hypothesis card that translational scientists can act on, ready to inform target prioritization, in-vitro experiments, or partnership conversations.
Hypothesis card
SLE · Rank 26
Direct evidencet
Candidate gene
aka BAFF, BLyS
Pathway context
Mechanistic role
Therapeutic implications
Sources


