Estimate the functional impact of deletions (in Kb) using the alpha genome

Hi,
I am working with a ~1400 bp deletion identified from ONT long-read sequencing (hg38). I want to evaluate its functional impact on local genomic features such as regulatory elements and gene activity.

Given that score_variant is designed for single variants, would it be appropriate to model this deletion as a single contiguous structural variant by defining the full reference span and corresponding deleted sequence as alternate_bases?

Alternatively, would you recommend generating a custom sequence with the deletion applied and using predict_sequence() followed by manual comparison with the reference predictions?

How reliable are these approaches for capturing the functional consequences of deletions of this size?

Hi There!

Thanks for reaching out.

The main difference here is if you want to predict biological activity or run a comparative analysis to measure the impact of a variant - both will work natively with your deletion.

You can define a genome.Variant with the 1400 bp deletion in alternate_bases and input into score_variant. Please note that while score_variant only considers a single genome.Variant, genome.Variant can be composed of multiple mutations/SNPs/indels. Or you can construct a custom sequence with the deletion removed and predict using predict_sequence().

AlphaGenome fully supports analyzing indels; insertions, deletions, and inversions of random lengths between 1 and 20 base pairs were used during distillation. There is no limit on the size of SV/indels that can be analyzed, however as your 1400 bp deletion exceeds this training distribution, the model’s predictions may be less reliable.

Kind regards,
Tumi

Hi Tumi,

Thank you for your answer re. estimating the functional impact of large deletions (in Kb) using AlphaGenome.

I am also interested in applying AlphaGenome to predict effects of a very large deletion (100 Kb) on gene expression of nearby genes. I’ve tried the function score_variant and got conflicting results for RNAseq with different cell lines. I understand that the model’s prediction may be less reliable due to the size of the deletion, however I’m curious to know whether different RNAseq quality may affect the prediction or not. For example, in the supplementary method, you mentioned “normalized RNA-seq tracks from ENCODE and GTEx were grouped by their ontology CURIE and assay type… Within each group, the normalized signals were averaged across all included experiments or individuals”.

Could this create some bias, for example if a cell lines (such as fibroblasts) are present with multiple high quality experiments, the prediction for this cell line will be more accurate than the ones with limited data in GTEX/ENCODE?