How to properly predict gene expression and histone changes for a large deletion using predict_sequence function

Hello,

Thank you for providing a wonderful tool to predict effects of variants on gene expression and regulation.

I am working with a large structural variant (100kb deletion) and would like to predict the effects of this variant on gene expression and histone marks. I tried both predict_variant and predict_sequence as below:

  • Predict_variant: I specified the variant with the reference allele being the full sequence + 100bp upstream and downstream, the alt allele has 100bp upstream and downstream of the deletion as below:

variant_id\tCHROM\tPOS\tREF\tALT

with tREF: 100bp upstream +deletion sequence (100kb) + 100bp downstream

tAlt: 100bp upstream+100bp downstream

This gave me some conflicting result in RNAseq prediction for cell types that typically express the genes upstream the deletion.

  • Predict_sequence: I tried 2 approaches:
    • Approach 1: run predict_sequence for a ref sequence (full sequence + 100bp upstream and downstream of the deletion) and alt sequence (only include 100bp upstream and downstream deletion). I used padding to add Ns to make the ref and alt sequence to 1Mb. This approach resulted in 0 in RNAseq prediction for genes in the region for both ref and alt sequence and only predicted loss of histone marks/CTCF bindings in the deletion region, not nearby regions.
    • Approach 2: I ran predict_sequence for a 1Mb sequence for Ref, then for the Alt sequence I took out 100kb where the deletion is, and pad this sequence to 1Mb. This approach resulted in consistent RNAseq prediction (reduced in expression) in all cell types/tissues tested.

My question:

  • What is the right approach to run AlphaGenome for a large deletion like this? In my opinion, approach 2 for predict_sequence seems to be more closely resembled of the reality, but I suspect the other approaches should provide somewhat similar results.
  • Why is the result from predict_sequence different from the 2 approaches? Does padding (adding Ns) to the sequence interfere with prediction?

Thank you.