Proposal for Analyzing Epigenetic Age Markers With AlphaGenome

AlphaGenome does not give the exact same output when the same input is repeatedly fed through the model. I propose measuring these slight differences to test if the distribution of outputs can be mapped to the demographic distribution of the datasets used to train AlphaGenome. If true for humans and mice, this would mean AlphaGenome has inadvertently (or purposely?) characterized the exact epigenetic changes that occur in people and mice as they age down to the base-pair resolution.

Proper verification would require running AlphaGenome enough times, across enough of each species’ genome, in enough tissues to match the size of the datasets while accounting for the different likelihoods of each age submitting different types of tissue samples (everyone sends blood, but not many teens get biopsies during colonoscopies). While GTEx and FANTOMS are already formatted for accessing age and sex, age data could theoretically be gotten for ENCODE and 4D Nucleome by tracing back tissue sample sources. Initial verification would be done on blood—because it has the most data from all kinds of ages—to see if AlphaGenome’s most common outputs match the known epigenomes of the most common blood donors (males over thirty). If this is true, larger-scale studies have reason to gather funding.

Creating a comprehensive map of the human epigenome as it ages means development can begin on identifying the protein-coding and non-coding genes that cause predicted histone modifications or how to “open” and “close” chromatin. Additionally, AlphaGenome’s strength in predicting differences across tissue types means age-related treatments can start as tissue-specific modifications, for example, practicing rejuvenating skin before editing cardiac muscle. This could be done using AlphaGenome’s core functionality of predicting disease-causing mutations by finding which mutations result in the same epigenetic changes found in the elderly.

7/21/25 Edit: *FANTOMS → FANTOM5

AlphaGenome scores for a given DNA sequence would need to be compared base pair by base pair, instead of by the output. Similarity will be quantified by either the value of each score’s base pair or the difference between values. This allows bins/quantiles of values to be compared to bins/quantiles of tissue sample donor demographics.

‘UBERON:0000178’, for blood, is not defined for all output types. The colon samples used in the demos and tutorials are good candidates for preliminary testing for the same reason they were used as examples.

After further reading, the proposal is not possible. Even if AlphaGenome’s deterministic structure was disrupted through slight variations in non-essential regions, all demographic data was lost during data processing. I see now that the genomic and epigenomic data AlphaGenome was trained on was averaged to specifically prevent factors such as age convoluting the consistency of In Silico Mutagenesis.

However, a genomic atlas of a young person and an old person can theoretically be assembled from the same data that was used to train AlphaGenome. From these atlases, existing methylation data, and known age-related genes, hotspots of high divergence in epigenetic profiles can be focused on. Then, through random mutations, variants can be scored whether they result in similar genomes to old or young people. If different DNA mutations to the same region can result in a young-resembling or old-resembling genome, by dose-response, there is a possibility of causation rather than correlation for whether that region of DNA keeps people healthy.