How to properly predict gene expression and histone changes for a large deletion using predict_sequence function

bb25 · March 25, 2026, 3:50pm

Hello,

Thank you for providing a wonderful tool to predict effects of variants on gene expression and regulation.

I am working with a large structural variant (100kb deletion) and would like to predict the effects of this variant on gene expression and histone marks. I tried both predict_variant and predict_sequence as below:

Predict_variant: I specified the variant with the reference allele being the full sequence + 100bp upstream and downstream, the alt allele has 100bp upstream and downstream of the deletion as below:

variant_id\tCHROM\tPOS\tREF\tALT

with tREF: 100bp upstream +deletion sequence (100kb) + 100bp downstream

tAlt: 100bp upstream+100bp downstream

This gave me some conflicting result in RNAseq prediction for cell types that typically express the genes upstream the deletion.

Predict_sequence: I tried 2 approaches:
- Approach 1: run predict_sequence for a ref sequence (full sequence + 100bp upstream and downstream of the deletion) and alt sequence (only include 100bp upstream and downstream deletion). I used padding to add Ns to make the ref and alt sequence to 1Mb. This approach resulted in 0 in RNAseq prediction for genes in the region for both ref and alt sequence and only predicted loss of histone marks/CTCF bindings in the deletion region, not nearby regions.
- Approach 2: I ran predict_sequence for a 1Mb sequence for Ref, then for the Alt sequence I took out 100kb where the deletion is, and pad this sequence to 1Mb. This approach resulted in consistent RNAseq prediction (reduced in expression) in all cell types/tissues tested.

My question:

What is the right approach to run AlphaGenome for a large deletion like this? In my opinion, approach 2 for predict_sequence seems to be more closely resembled of the reality, but I suspect the other approaches should provide somewhat similar results.
Why is the result from predict_sequence different from the 2 approaches? Does padding (adding Ns) to the sequence interfere with prediction?

Thank you.

Tumi_Makgatho · March 26, 2026, 11:24am

Hi There!

Thanks for reaching out.

Approach 2 is the correct method for this analysis. AlphaGenome relies on a full 1-Mb genomic context to capture essential long-range regulatory elements. Approach 1 failed because replacing ~900-kb of natural DNA with "N"s removes this regulatory landscape, resulting in near-zero predicted expression. Padding with "N"s works, provided you preserve the surrounding genomic sequence.
While predict_variant supports these deletions, the model was trained on structural variations up to 20 base pairs. Predictions for a 100-kb deletion should be evaluated carefully and may be limited due to the large loss of sequence context.

Kind regards.
Tumi

ProximaCentauri · March 31, 2026, 10:35am

Hey,

I created an alt sequence with multiple 20 bp deletions (120bp in total) and did a variant effect prediction and scoring for it. I was wondering if this size also influences the accuracy of my results. And if the second approach mentioned above would also be better in my case. However I don’t understand what padding with "N"s mean.

Thank you in advance

Tumi_Makgatho · April 2, 2026, 8:52am

Hi There,

Thanks for reaching out.

We suggest checking out the open-source implementation for some inspiration. You can see how we handle extracting extended sequences to account for deletions in this section of the codebase: https://github.com/google-deepmind/alphagenome_research/blob/main/src/alphagenome_research/io/genome.py#L101-L139

Kind regards,
Tumi

Topic		Replies	Views
Estimate the functional impact of deletions (in Kb) using the alpha genome Help & Support categorising-content , testing	3	163	March 25, 2026
Uploading sequences to AlphaGenome Help & Support	5	1994	July 17, 2025
Multiple Deletions Help & Support	1	175	February 6, 2026
Coordiate change when predicting long variants Feedback & Feature Requests	4	1656	November 5, 2025
Support for Multi-Base Mutations in AlphaGenome Feedback & Feature Requests	3	1263	April 29, 2026

How to properly predict gene expression and histone changes for a large deletion using predict_sequence function

Related topics