Is it possible to add more data for a particular cell type?

Ilya_Flyamer · June 26, 2025, 3:49pm

Thanks a lot for this great work, and making the model so easy to access! I am working with mouse embryonic stem cells (‘EFO:0004038’). It seems like AlphaGenome can output only contact maps for it… Which to me is very unexpected, since it’s one of the most popular models in epigenetics research with lots of available ChIP, ATAC, RNA-seq, etc datasets. I am guessing it’s not trivial, but is it possible to add more tracks somehow? Are you planning to train the model further to include more data?

Dhavi · June 26, 2025, 4:51pm

Hi! It’s not currently possible to add more tracks, but if you’re familiar with any particular datasets please let us know.

Ziga_Avsec · June 26, 2025, 7:44pm

Hey! We don’t have a mechanism for simply extending the tracks yet. We could consider training another model in the future with more data or people could fine-tune the model on the data of interest once we release the weights.

Do you have any pointers to resources with more mESC tracks?

The main reason for not having them in the model is that for ChIP, ATAC, RNA-seq datasets, we only used data from ENCODE. I couldn’t find any datsets on mESC search.

Ilya_Flyamer · June 27, 2025, 7:26am

I see, thank you for the explanation! Indeed, it makes sense that you can’t just collect all different data from GEO and need a centralized resource with uniform processing etc… Well, there is some data on 4DN, although for mESCs there is also not too much, and the sample type annotation sometimes is more detailed (i.e. specific cell line).

Then there are also databases that collect all epigenomic data from GEO, reanalyse and QC. Off the top of my head I remember https://db3.cistrome.org/browser/. There is a ton of data there, like ~100 ATAC-seq tracks just for mESCs, including different perturbations. But since the metadata are automatically extracted from a variety of styles, there are some errors that are hard to avoid…

Ilya_Flyamer · July 17, 2025, 1:36pm

For future reference, there (at least) threee ontologies that correspond to mouse ES cells, and using the other two allows predicting a lot of ChIP-seqs and RNA-seq: [‘EFO:0007751’, ‘EFO:0004038’, ‘EFO:0005483’]. This makes much more sense to me

Topic		Replies	Views
DNA methylation data for gene expression prediction Feedback & Feature Requests	2	222	March 2, 2026
Why are there different tracks for different cell line Help & Support	5	1253	March 6, 2026
How to fine-tune for new downstream tasks (such as identifying polyA sites) and extract embeddings? Help & Support	1	41	February 24, 2026
Did AlphaGenome Use Reference Genome DNA or Sequencing Data to Predict Functional Genomic Tracks during training? Help & Support	2	34	March 4, 2026
Is there a way to obtain embeddings? Feedback & Feature Requests	2	1386	February 1, 2026

Is it possible to add more data for a particular cell type?

Related topics