Did AlphaGenome Use Reference Genome DNA or Sequencing Data to Predict Functional Genomic Tracks during training?

Hello, I am a postgraduate student interested in genome foundation models. I have been carefully reading the AlphaGenome study and greatly appreciate the impressive work presented.

I would be very grateful if you could kindly clarify one methodological detail. In the AlphaGenome study, were functional genomic tracks (e.g., ChIP-seq, ATAC-seq, RNA-seq signals) predicted directly from the reference genome DNA sequence (hg38/mm10), meaning that the model learns to fit experimental signal tracks from sequence alone? Or were aligned sequencing reads (i.e., reads mapped to the reference genome) used as model inputs to generate or reconstruct these tracks?

Hi There!

Thanks for reaching out.

In the AlphaGenome study, functional genomic tracks (such as ChIP-seq, ATAC-seq, and RNA-seq signals) are predicted directly from the DNA sequence alone. The model does not use aligned sequencing reads as inputs to generate or reconstruct these tracks.

The model specifically processes a 1-megabase DNA sequence and a species identifier (denoting whether the sequence is from the human or mouse genome) to generate its predictions. While reference genomes (hg38/mm10) were used during training, aligned sequencing reads were only utilized during the initial data processing phase to create the ground-truth experimental signal tracks (such as bigWig files) that the model learns to fit during training. Reference Genomes were used during training

Kind regards,
Tumi

1 Like

Thank you for the clarification.