Is it possible to add more data for a particular cell type?

Thanks a lot for this great work, and making the model so easy to access! I am working with mouse embryonic stem cells (‘EFO:0004038’). It seems like AlphaGenome can output only contact maps for it… Which to me is very unexpected, since it’s one of the most popular models in epigenetics research with lots of available ChIP, ATAC, RNA-seq, etc datasets. I am guessing it’s not trivial, but is it possible to add more tracks somehow? Are you planning to train the model further to include more data?

3 Likes

Hi! It’s not currently possible to add more tracks, but if you’re familiar with any particular datasets please let us know.

1 Like

Hey! We don’t have a mechanism for simply extending the tracks yet. We could consider training another model in the future with more data or people could fine-tune the model on the data of interest once we release the weights.

Do you have any pointers to resources with more mESC tracks?

The main reason for not having them in the model is that for ChIP, ATAC, RNA-seq datasets, we only used data from ENCODE. I couldn’t find any datsets on mESC search.

1 Like

I see, thank you for the explanation! Indeed, it makes sense that you can’t just collect all different data from GEO and need a centralized resource with uniform processing etc… Well, there is some data on 4DN, although for mESCs there is also not too much, and the sample type annotation sometimes is more detailed (i.e. specific cell line).

Then there are also databases that collect all epigenomic data from GEO, reanalyse and QC. Off the top of my head I remember https://db3.cistrome.org/browser/. There is a ton of data there, like ~100 ATAC-seq tracks just for mESCs, including different perturbations. But since the metadata are automatically extracted from a variety of styles, there are some errors that are hard to avoid…

1 Like

For future reference, there (at least) threee ontologies that correspond to mouse ES cells, and using the other two allows predicting a lot of ChIP-seqs and RNA-seq: [‘EFO:0007751’, ‘EFO:0004038’, ‘EFO:0005483’]. This makes much more sense to me :slight_smile: