Proposal for Analyzing Epigenetic Age Markers With AlphaGenome

AlphaGenome does not give the exact same output when the same input is repeatedly fed through the model. I propose measuring these slight differences to test if the distribution of outputs can be mapped to the demographic distribution of the datasets used to train AlphaGenome. If true for humans and mice, this would mean AlphaGenome has inadvertently (or purposely?) characterized the exact epigenetic changes that occur in people and mice as they age down to the base-pair resolution.

Proper verification would require running AlphaGenome enough times, across enough of each species’ genome, in enough tissues to match the size of the datasets while accounting for the different likelihoods of each age submitting different types of tissue samples (everyone sends blood, but not many teens get biopsies during colonoscopies). While GTEx and FANTOMS are already formatted for accessing age and sex, age data could theoretically be gotten for ENCODE and 4D Nucleome by tracing back tissue sample sources. Initial verification would be done on blood—because it has the most data from all kinds of ages—to see if AlphaGenome’s most common outputs match the known epigenomes of the most common blood donors (males over thirty). If this is true, larger-scale studies have reason to gather funding.

Creating a comprehensive map of the human epigenome as it ages means development can begin on identifying the protein-coding and non-coding genes that cause predicted histone modifications or how to “open” and “close” chromatin. Additionally, AlphaGenome’s strength in predicting differences across tissue types means age-related treatments can start as tissue-specific modifications, for example, practicing rejuvenating skin before editing cardiac muscle. This could be done using AlphaGenome’s core functionality of predicting disease-causing mutations by finding which mutations result in the same epigenetic changes found in the elderly.

7/21/25 Edit: *FANTOMS → FANTOM5

AlphaGenome scores for a given DNA sequence would need to be compared base pair by base pair, instead of by the output. Similarity will be quantified by either the value of each score’s base pair or the difference between values. This allows bins/quantiles of values to be compared to bins/quantiles of tissue sample donor demographics.

‘UBERON:0000178’, for blood, is not defined for all output types. The colon samples used in the demos and tutorials are good candidates for preliminary testing for the same reason they were used as examples.

After further reading, the proposal is not possible. Even if AlphaGenome’s deterministic structure was disrupted through slight variations in non-essential regions, all demographic data was lost during data processing. I see now that the genomic and epigenomic data AlphaGenome was trained on was averaged to specifically prevent factors such as age convoluting the consistency of In Silico Mutagenesis.

However, a genomic atlas of a young person and an old person can theoretically be assembled from the same data that was used to train AlphaGenome. From these atlases, existing methylation data, and known age-related genes, hotspots of high divergence in epigenetic profiles can be focused on. Then, through random mutations, variants can be scored whether they result in similar genomes to old or young people. If different DNA mutations to the same region can result in a young-resembling or old-resembling genome, by dose-response, there is a possibility of causation rather than correlation for whether that region of DNA keeps people healthy.

I’ve put together an inverse design model/module I call EEPM3 (Expandable Epigenetic Profile Mimicry Module by Mutation) to complement AlphaGenome’s forward design of epigenetics.

EEPM3 learns how to mutate a given sequence to achieve any desired epigenetic profile. During training and inference, AlphaGenome is used to determine the result of EEPM3’s mutations. EEPM3 is rewarded more when AlphaGenome’s output better matches the desired epigenetic profile.

Please note that I don’t have access to professional-grade resources, so everything was done with free resources using T4s in Colab, Kaggle, and my Google AI Pro Student Plan for Gemini and Antigravity. Thus, EEPM3 is currently a proof that inverse design can be applied to epigenetics and an invitation for everyone to help scale up for research and clinical applications.

Justification:
The genomic AI revolution has largely focused on the Forward Problem: predicting biological function from a fixed DNA sequence (X → Y). While profound for diagnostics, this leaves a gap in therapeutics. If we identify a disease-state epigenetic profile, knowing what it looks like isn’t enough—we need to know exactly which sequence mutations are required to force the cell into a healthy target state (Y_{target} X–> X_{mutated}).
EEPM3 was built to solve this Inverse Problem, acting as a universal epigenetic state-matcher. Whether the goal is to safely induce pluripotency, reverse aging markers, or the specifics of demographics, the underlying mathematical challenge remains the same: navigating an astronomically huge, discrete sequence space to find the optimal edits.

Methodology: Why GFlowNets?
In sequence design, traditional Reinforcement Learning or Markov Chain Monte Carlo methods often find one good mutation path and relentlessly exploit it. In biology, not only are there potentially millions of different solutions, but one proposed mutation may be lethal only to embryos or difficult to synthesize. We need hundreds of valid alternatives.
To achieve this, EEPM3 uses Generative Flow Networks (GFlowNets). Modeling the mutation process as a fluid network where “flow” is proportional to reward, EEPM3 learns to sample highly diverse set of potential mutations. GFlowNets also grade each mutation set’s utility because utility is proportional to reward.

Because I was constrained by Kaggle’s free 16GB VRAM limits and Colab’s T4 GPUs, I optimized the architecture in four ways:

Sub-Trajectory Evaluation Balance (Sub-EB): Standard models only grade the final 100kb mutated sequence, making it impossible for the AI to figure out which specific edit actually helped. EEPM3 fixes this by grading partial sequences step-by-step. It provides the network with dense, immediate feedback on every single mutation without exploding memory requirements.

Retrospective Backward Synthesis (RBS): Hitting the AlphaGenome API is our biggest time bottleneck. RBS is a zero-cost data multiplier. When the model finds a successful mutated sequence, RBS mathematically hallucinates alternative mutation orders to reach that exact same genetic state. This gives us massively expanded training data from a single successful API call.

α-GFN Objective: A tuning mechanism that explicitly balances exploration (trying wild, novel mutations) with exploitation (refining known good edits). This stabilizes the training loop and prevents the model from getting lost in the astronomically huge—mostly empty—mutation space.

Biological Priors & Regularization: Neural networks take shortcuts if left unchecked. They will propose lethal DNA mutations just to trick AlphaGenome into giving a high epigenetic score. EEPM3 uses Evo-2 (a foundational DNA language model) as a biological guardrail. It penalizes lethal mutations, forcing the AI to stay strictly within the boundaries of evolutionarily viable DNA.

Discussion and Next Steps
The first caveat I want to point out is that AlphaGenome (and therefore EEPM3) predicts the steady-state equilibrium of an epigenome, not the chronological and physical pathway a cell takes to get there. EEPM3 asks: “If a cell were born with this exact DNA, what would its stable epigenetic landscape look like [in this specific cellular context from AlphaGenome’s settings]?”
The implication of EEPM3 is that we no longer have to rely exclusively on observing nature to find regulatory rules. We can define our desired epigenetic endpoint and let the flow network mathematically derive the mutational blueprint required to get there.
EEPM3 is not to determine what any epigenome looks like (we can simply get a sample for that). EEPM3 is an enhancing tool to understand epigenetic regulation. It’s mutations still need to be interpreted and experimented on in silico and in vitro before any true medication or clinical practice can be developed.
Despite operating on free-tier cloud instances, the current pipeline successfully executes a highly robust, fault-tolerant async loop capable of vectorizing sequence generation, navigating API rate limits via exponential backoff, and achieving statistical convergence in under 100 epochs. In our latest benchmark, the model successfully navigated the AlphaGenome API to map DNASE accessibility on a massive 100,000 base-pair sequence (N-padded to 131k to meet API constraints), achieving a mathematically validated 14.3% EMA loss drop on the offline replay buffer.

My Experiment

1. Setup (Initialization)

  • Sequence Context: A base sequence of 100,000 base pairs representing a target genetic locus.
  • API Preparation: The sequence was N-padded to exactly 131,072 base pairs (2^17) to strictly comply with the AlphaGenome API tensor requirements.
  • Target Modality: The specific epigenetic target profile was set entirely to DNASE (chromatin accessibility).

2. Generation (Async Sampling & Scoring)

  • Vectorized Sampling: The EEPM3 script GeneratorPolicyV2 mathematically explored the mutation sequence space to construct candidate trajectories.
  • Fault-Tolerant Oracle Scoring: 2_api_worker.py queried the AlphaGenome API. It successfully intercepted 429 Too Many Requests (rate limits) and 503 Service Unavailable errors, utilizing an exponential backoff algorithm to ensure zero data loss across hours of querying.
  • Reward Computation: The API output was evaluated using custom Masked Modality Loss (\mathcal{L}_{mask}) to compute the R(x) reward, strictly masking out unpredicted or missing biological sparse tracks to prevent gradient corruption (NaN leakage).

3. Multiplication (Retrospective Backward Synthesis)

  • Zero-Cost Augmentation: For the high-reward trajectories discovered in Step 2, 4_rbs_augmenter.py hallucinated valid alternative mutation pathways using synthetic backward permutations.
  • Yield: This synthetically amplified the high-fidelity training data by 1.5x without making a single extra, costly API call, dumping all data into a robust SQLite WAL mode database.

4. Optimization (Offline Convergence)

  • Architecture Constraint: The GFlowNet dual-head offline trainer was constrained to just 34,136 parameters (avoiding O(N * V) explosion).
  • Loss Mechanics: Utilizing the Sub-Trajectory Evaluation Balance (Sub-EB) loss function via the α-GFN objective.
  • The Result: The trainer computationally optimized against the augmented replay buffer, reaching a strictly calculated statistical convergence (a mathematically verified 14.30% drop in the Exponential Moving Average of the loss) at exactly Epoch 82.

More About the Architecture
If you are interested in the mathematical formulation (L_{TB} equations), the JAX/Flax implementation details, or if you represent a lab with processed clinical multi-omic tensors interested in biological validation, I invite you to explore the repository.

The README contains a comprehensive breakdown of the SOTA pipeline methodology and the complete architecture.

GitHub Repository: https://github.com/tienhdsn-000001/EEPM3
(Let me know your thoughts or open an issue on the repo—I’d love more than anything for collaborators to take this from in silico to in vitro!)