When I’m doing some longer variants’ expression prediction (long SVs like deletion or insertion), I found that the coordinate of the alternative expression will change with the insertion/deletion, which will create a mismatch in both expression patterns.
In insertion:
(I reached the picture limit as a new user. The plot should be some red line showed exact same pattern as the blue line, however in the latter position.)
This phenomenon is kind of annoying because coordinate change makes it hard to compare the expression levels. I assume there should be some auto-correction that adjusts the difference. I did some adjustments on the data (cut the extra sequence or add some extra NA), but it created some missing positions. Is there any way to avoid this?
Thanks for the post! Yes we’re aware of this frameshift issue and will try and fix in the future. In the meantime, it should be somewhat possible to manually pad/trim the predictions to do alignment (though we can definitely make this easier!). Something like this should work:
Thanks @tward for replying to this thread, and thank you @Jingyu_Zeng for bring up this question.
I am facing the same issue, so I’d like to follow-up on the process of internal fix from the AlphaGenome side.
My use case is to use dna_model.score_variant to get the variant scores. When my ref and alt alleles are different in length, the tracks are shifted and thus it always scores high for splicing changes, and most of them are false positives. For instance, this variant, chr2:178637146:GATATATAT>GATATATATATATAT, score_variant gives me a value of 0.9960933 for SPLICE_SITES, while when looking at the track, it is just because of the coordinates shift (top panel in the figure below).
I am able to change the plotting as tward suggested, but I don’t know how to implement it before scoring for splicing (usage, junction, sites) with the score_variant function. Also, another problem I noticed is, if I pad the array, the track after the variant location is fixed, however, the tract before the variant location is shifted backward and now it is off, which will also cause false positive calls if we score the variant for splicing on this padded data:
Thanks for this report, and sorry for the slow response.
re: variant scoring, currently we don’t do any variant normalization, which we think might be tripping our sequence alignment. We’ll look into fixing this, but for the time being you should be able to get more accurate predictions by providing normalized, left-aligned variants, using something like bcftools to do the normalization.
For example, normalizing your chr2:178637146:GATATATAT>GATATATATATATAT to chr2:178637146:G>GATATAT we get a more reasonable raw score of 0.566406 splice site score.
@Yue_Zhou apologies for this very slow reply, but we have since fixed this issue in our variant sequence alignment and you should now get more reasonable scores (though we still recommend normalizing any variants to be absolutely sure).
As of October, scoring both the normalized and un-normalized variants should yield the same result:
from alphagenome import colab_utils
from alphagenome.data import genome
from alphagenome.models import dna_client
from alphagenome.models import variant_scorers
import numpy as np
model = dna_client.create(colab_utils.get_api_key())
scores = model.score_variant(
genome.Interval.from_str('chr2:178112862-179161438:.'),
genome.Variant.from_str('chr2:178637146:GATATATAT>GATATATATATATAT'),
variant_scorers=[
variant_scorers.GeneMaskSplicingScorer(
requested_output=dna_client.OutputType.SPLICE_SITES,
width=None,
)
],
)
x = np.nanmax(scores[0].X).item()
scores = model.score_variant(
genome.Interval.from_str('chr2:178112862-179161438:.'),
genome.Variant.from_str('chr2:178637146:G>GATATAT'),
variant_scorers=[
variant_scorers.GeneMaskSplicingScorer(
requested_output=dna_client.OutputType.SPLICE_SITES,
width=None,
)
],
)
y = np.nanmax(scores[0].X).item()
print(f'{x=}, {y=}')