Coordiate change when predicting long variants

Jingyu_Zeng · July 15, 2025, 11:14am

Hi everyone,

When I’m doing some longer variants’ expression prediction (long SVs like deletion or insertion), I found that the coordinate of the alternative expression will change with the insertion/deletion, which will create a mismatch in both expression patterns.

In deletion:

In insertion:
(I reached the picture limit as a new user. The plot should be some red line showed exact same pattern as the blue line, however in the latter position.)

This phenomenon is kind of annoying because coordinate change makes it hard to compare the expression levels. I assume there should be some auto-correction that adjusts the difference. I did some adjustments on the data (cut the extra sequence or add some extra NA), but it created some missing positions. Is there any way to avoid this?

tward · July 17, 2025, 9:57am

Thanks for the post! Yes we’re aware of this frameshift issue and will try and fix in the future. In the meantime, it should be somewhat possible to manually pad/trim the predictions to do alignment (though we can definitely make this easier!). Something like this should work:

from alphagenome.data import genome
from alphagenome.data import track_data
from alphagenome.models import dna_client
from alphagenome.visualization import plot
from alphagenome.visualization import plot_components
import numpy as np

model = dna_client.create('MY_KEY')

variant = genome.Variant.from_str(
    'chr1:227278338:GATTACAGATTACAGATTACAGATTACAGATTACA>C'
)
interval = variant.reference_interval.resize(2048)

predictions = model.predict_variant(
    interval,
    variant,
    requested_outputs=[dna_client.OutputType.RNA_SEQ],
    ontology_terms=['UBERON:992'],
)

align_alternate = track_data.TrackData(
    np.pad(
        predictions.alternate.rna_seq.values,
        ((len(variant.reference_bases), 0), (0, 0)),
    )[:-len(variant.reference_bases), :],
    predictions.alternate.rna_seq.metadata,
    interval=predictions.alternate.rna_seq.interval,
)

_ = plot_components.plot(
    [
        plot_components.OverlaidTracks(
            {
                'REF': predictions.reference.rna_seq,
                'ALT': align_alternate,
            },
            colors={'REF': 'dimgrey', 'ALT': 'red'},
        )
    ],
    interval,
    annotations=[plot_components.VariantAnnotation([variant], alpha=0.8)],
)

Hope that helps!

Yue_Zhou · August 12, 2025, 9:57pm

Thanks @tward for replying to this thread, and thank you @Jingyu_Zeng for bring up this question.

I am facing the same issue, so I’d like to follow-up on the process of internal fix from the AlphaGenome side.

My use case is to use dna_model.score_variant to get the variant scores. When my ref and alt alleles are different in length, the tracks are shifted and thus it always scores high for splicing changes, and most of them are false positives. For instance, this variant, chr2:178637146:GATATATAT>GATATATATATATAT, score_variant gives me a value of 0.9960933 for SPLICE_SITES, while when looking at the track, it is just because of the coordinates shift (top panel in the figure below).
I am able to change the plotting as tward suggested, but I don’t know how to implement it before scoring for splicing (usage, junction, sites) with the score_variant function. Also, another problem I noticed is, if I pad the array, the track after the variant location is fixed, however, the tract before the variant location is shifted backward and now it is off, which will also cause false positive calls if we score the variant for splicing on this padded data:

tward · August 26, 2025, 8:35pm

Thanks for this report, and sorry for the slow response.

re: variant scoring, currently we don’t do any variant normalization, which we think might be tripping our sequence alignment. We’ll look into fixing this, but for the time being you should be able to get more accurate predictions by providing normalized, left-aligned variants, using something like bcftools to do the normalization.

For example, normalizing your chr2:178637146:GATATATAT>GATATATATATATAT to chr2:178637146:G>GATATAT we get a more reasonable raw score of 0.566406 splice site score.

Tom

Topic		Replies	Views
Uploading sequences to AlphaGenome Help & Support	5	390	July 17, 2025
Identifying Regions of Allelic Disruption Help & Support	1	153	July 10, 2025
Support for Multi-Base Mutations in AlphaGenome Feedback & Feature Requests	0	23	August 27, 2025
Translocation variants/structural variants input Help & Support	1	163	July 30, 2025
Overlaying Reference and Alternate variant tracks Help & Support	2	131	July 12, 2025

Coordiate change when predicting long variants

Related topics