Hello,
Thank you for providing a wonderful tool to predict effects of variants on gene expression and regulation.
I am working with a large structural variant (100kb deletion) and would like to predict the effects of this variant on gene expression and histone marks. I tried both predict_variant and predict_sequence as below:
- Predict_variant: I specified the variant with the reference allele being the full sequence + 100bp upstream and downstream, the alt allele has 100bp upstream and downstream of the deletion as below:
variant_id\tCHROM\tPOS\tREF\tALT
with tREF: 100bp upstream +deletion sequence (100kb) + 100bp downstream
tAlt: 100bp upstream+100bp downstream
This gave me some conflicting result in RNAseq prediction for cell types that typically express the genes upstream the deletion.
- Predict_sequence: I tried 2 approaches:
- Approach 1: run predict_sequence for a ref sequence (full sequence + 100bp upstream and downstream of the deletion) and alt sequence (only include 100bp upstream and downstream deletion). I used padding to add Ns to make the ref and alt sequence to 1Mb. This approach resulted in 0 in RNAseq prediction for genes in the region for both ref and alt sequence and only predicted loss of histone marks/CTCF bindings in the deletion region, not nearby regions.
- Approach 2: I ran predict_sequence for a 1Mb sequence for Ref, then for the Alt sequence I took out 100kb where the deletion is, and pad this sequence to 1Mb. This approach resulted in consistent RNAseq prediction (reduced in expression) in all cell types/tissues tested.
My question:
- What is the right approach to run AlphaGenome for a large deletion like this? In my opinion, approach 2 for predict_sequence seems to be more closely resembled of the reality, but I suspect the other approaches should provide somewhat similar results.
- Why is the result from predict_sequence different from the 2 approaches? Does padding (adding Ns) to the sequence interfere with prediction?
Thank you.