Availability of preprocessed training data

Hi,

I would like to evaluate alphagenome’s performance at track prediction against other models, specifically, gtex coverage and splice junctions.

To make the comparison fair, I correlate predictions against training data.

From the paper, I understand that you averaged coverage across all bigwigs for each tissue and combined splice junctions. Where can we download these preprocessed data?

Thank you very much in advance for your help,

Miquel

Hi There!

Thanks for reaching out.

Currently, the preprocessed training data, including averaged GTEx coverage and combined splice junctions, are not available for download. While the team is considering adding precomputed static data to our longer-term roadmap, we do not currently have other publicly shareable material for the curated dataset.

To recreate the training data, you can download the primary experimental datasets using the comprehensive list of file accessions provided in Supplementary Table 2 of our publication. You can then apply the data processing and filtering steps outlined in the Methods section.

For more details on the training data, please refer to the following resource here

Kind regards,
Tumi

Thank you @Tumi_Makgatho for the quick reply, I did it and I get different results from your paper’s figure 2c, with worse performance. And because the preprocessing as described in the methods is quite lengthy, I am unable to discern whether I made a mistake in the data preprocessing or the performance is actually not as good as reported.

Alternatively, could you share the scripts used to preprocess the data?

Thanks again! Best,

Miquel

Hi Miquel,

The the training data is available, but not GTEX tracks due to licensing constraints. You can use the following resource: https://github.com/google-deepmind/alphagenome_research/blob/main/src/alphagenome_research/io/dataset.py to load the training datasets.

Kind regards,
Tumi