How to properly predict gene expression and histone changes for a large deletion using predict_sequence function

Hello,

Thank you for providing a wonderful tool to predict effects of variants on gene expression and regulation.

I am working with a large structural variant (100kb deletion) and would like to predict the effects of this variant on gene expression and histone marks. I tried both predict_variant and predict_sequence as below:

  • Predict_variant: I specified the variant with the reference allele being the full sequence + 100bp upstream and downstream, the alt allele has 100bp upstream and downstream of the deletion as below:

variant_id\tCHROM\tPOS\tREF\tALT

with tREF: 100bp upstream +deletion sequence (100kb) + 100bp downstream

tAlt: 100bp upstream+100bp downstream

This gave me some conflicting result in RNAseq prediction for cell types that typically express the genes upstream the deletion.

  • Predict_sequence: I tried 2 approaches:
    • Approach 1: run predict_sequence for a ref sequence (full sequence + 100bp upstream and downstream of the deletion) and alt sequence (only include 100bp upstream and downstream deletion). I used padding to add Ns to make the ref and alt sequence to 1Mb. This approach resulted in 0 in RNAseq prediction for genes in the region for both ref and alt sequence and only predicted loss of histone marks/CTCF bindings in the deletion region, not nearby regions.
    • Approach 2: I ran predict_sequence for a 1Mb sequence for Ref, then for the Alt sequence I took out 100kb where the deletion is, and pad this sequence to 1Mb. This approach resulted in consistent RNAseq prediction (reduced in expression) in all cell types/tissues tested.

My question:

  • What is the right approach to run AlphaGenome for a large deletion like this? In my opinion, approach 2 for predict_sequence seems to be more closely resembled of the reality, but I suspect the other approaches should provide somewhat similar results.
  • Why is the result from predict_sequence different from the 2 approaches? Does padding (adding Ns) to the sequence interfere with prediction?

Thank you.

Hi There!

Thanks for reaching out.

Approach 2 is the correct method for this analysis. AlphaGenome relies on a full 1-Mb genomic context to capture essential long-range regulatory elements. Approach 1 failed because replacing ~900-kb of natural DNA with "N"s removes this regulatory landscape, resulting in near-zero predicted expression. Padding with "N"s works, provided you preserve the surrounding genomic sequence.
While predict_variant supports these deletions, the model was trained on structural variations up to 20 base pairs. Predictions for a 100-kb deletion should be evaluated carefully and may be limited due to the large loss of sequence context.

Kind regards.
Tumi

1 Like

Hey,

I created an alt sequence with multiple 20 bp deletions (120bp in total) and did a variant effect prediction and scoring for it. I was wondering if this size also influences the accuracy of my results. And if the second approach mentioned above would also be better in my case. However I don’t understand what padding with "N"s mean.

Thank you in advance

Hi There,

Thanks for reaching out.

We suggest checking out the open-source implementation for some inspiration. You can see how we handle extracting extended sequences to account for deletions in this section of the codebase: https://github.com/google-deepmind/alphagenome_research/blob/main/src/alphagenome_research/io/genome.py#L101-L139

Kind regards,
Tumi