Hi everyone,
I wrote a blog about my learnings while working on AlphaGenome: AlphaGenome: A journey through the biology and the data. I thought this community might enjoy it ![]()
When I joined the Science team at Google DeepMind to work on AlphaGenome, I came from a vision and language modeling background. The learning curve for the biology was incredibly steep, but the engineering challenges are absolutely fascinating. I wrote this post as the guide I wish I had when I first started.
In the post, I break down the core biology (genes, phenotypes, and why we care about tracking mutations) and dive deep into what makes AlphaGenome such a unique ML challenge. A few highlights:
-
The Scale: Dealing with a model that has a 1-million token context window and outputs almost 6,000 prediction channels per position—resulting in about 6 billion predictions per 1M slice.
-
The Weird Training Data: How we train a model on effectively just two examples (a single human genome and a single mouse genome) using 50 TB of bulk sequencing data across ~300 conditions.
-
The Impact: How foundational models like AlphaGenome can digitize expensive wet-lab experiments, act as a queryable retrieval engine for biology, and dramatically increase research velocity for things like cancer treatment and generative genome design.
If you are transitioning into computational biology from traditional ML, or if you’re just curious about the data engine powering these massive models, I’d love for you to give it a read.
I’d love to hear your thoughts! How do you see foundational genomics models shaping the future of wet-lab research and genetic engineering?
Cheers,
Adam