Hi everyone,
We’ve identified an issue with how the model currently handles indels, we’re working on a longer term fix (similar to the approach in Korsakova et al. (2025), shift augmentation improves…) . The fix will be applied to the API and model. Below is a description of the issue and our suggestion for an interim workaround.
The issue: When indels are inserted into a sequence, we anchor the sequence at the start of the indel. For deletions, the sequence shifts to the left, and additional bases are added to fill in the deleted positions. For insertions, bases are pushed to the right from the anchor point and removed at the end of the sequence.
This difference in position between the reference and alternate sequences after the anchor results in numerical discrepancies due to pooling, convolutions, etc in the model’s predictions downstream (to the right) of the variant. When running score variants, the difference between REF and ALT are skewed due to numerical effect even if the variant had no effect. The picture below depicts the frameshift that occurs.
Suggested workaround: In the interim, to prevent skewed REF vs. ALT scores, we recommend that you use a separate calibration process for indels.
As always, let us know if you have any questions.
Kyle
AlphaGenome Team
Dear dr. Taylor,
Do you have an update on the indel inference? I am very interested in using this in my own research.
I have come to understand from the Korsakova et al paper that the distributions of indel predictions are very different from the SNP predictions, and that it is not trivial to combine these:
Despite a substantial reduction in the prediction variance introduced by indels with the stitching strategy, the distribution of predicted indel scores still diverges from the SNP score distribution. For applications that require fully matched distributions, additional techniques like quantile normalization, could be used. However, this may be counterproductive for larger indels, which likely have effect distributions reflecting larger influence. Fully alleviating the technical variance introduced by indels may require redesigning convolutional network architectures to remove all boundaries, e.g. by avoiding pooling and striding blocks and predicting at nucleotide resolution. The evaluations described here may be used to aid this architecture design evolution.
Out of curiosity, is this something that your team is also tackling in this work now that Alphagenome has nucelotide resolution? Or is this something that will still require quantile normalization for the end user?
Best regards,
Tijs van Lieshout