Uploading sequences to AlphaGenome

Hi everyone! I am wondering whether it is possible to upload a WT vs a modified sequence (e.g. mutation or missing 500-1000bp) to analyze the gene expression and open chromatin predictions?
I have been reading the tutorials, quick guide, etc., but I get the impression that I cannot upload the sequence itself?
Thank you!

Hello,

If you can form the modification as a Variant you can make a prediction via predict_variant.
Otherwise you should be able to use predict_sequence to make two predictions (the WT and modified sequence) and then compare and changes in expression. For example:

from alphagenome.models import dna_client

model = dna_client.create('my_key')

reference = 'A' * dna_client.SEQUENCE_LENGTH_2KB
alternate = 'G' * dna_client.SEQUENCE_LENGTH_2KB

ref_predictions = model.predict_sequence(
    reference,
    requested_outputs=[dna_client.OutputType.RNA_SEQ],
    ontology_terms=['UBERON:0001496'],
)

alt_predictions = model.predict_sequence(
    reference,
    requested_outputs=[dna_client.OutputType.RNA_SEQ],
    ontology_terms=['UBERON:0001496'],
)

You can then do e.g. ref_predictions.rna_seq - alt_predictions.rna_seq to compute the diff.

Hope that helps!

2 Likes

Hi! Thanks a lot for the response :slight_smile:
I basically have a 500 000 bp sequence as WT and an altered one of 499 600 bp. Shall I just upload them as follows?:
WT_reference = “copy paste the 500 000 bp sequence” * dna_client.SEQUENCE_LENGTH_0.57KB
P_altered = “copy paste the 499 600 bp sequence” * dna_cliente.SEQUENCE_LENGTH_0.49MB

Thanks a lot again!

Not quite: you’ll need to pad the sequence strings to be a supported length, which you can do using Python string methods. So for your example:

WT_reference = "GATTACA"
WT_reference = WT_reference.center(dna_client.SEQUENCE_LENGTH_500KB, 'N')

P_altered = 'GATTACA'
P_altered = P_altered.center(dna_client.SEQUENCE_LENGTH_500KB, 'N')

Will pad the WT and P_altered references with N’s to be a supported 500KB length. You can then make the two predictions and compare using e.g. OverlaidTracks (see the Visualizing predictions — AlphaGenome for examples).

Thanks again for your great work! Just a quick clarification:

If I format modifications as Variant, is it possible to combine multiple variants together for a single prediction (i.e., one forward pass within same context window)? Or in that case, is it better to use predict_sequence instead?

Also, when using Variant, does the model still reconstruct the full sequence internally? I assume yes — but if not, is there any known difference in either speed or output performance compared to predict_sequence?

Thank you again :slight_smile:

If I format modifications as Variant , is it possible to combine multiple variants together for a single prediction (i.e., one forward pass within same context window)?

No, our variant pipeline can only support single position edits. For more complicated variants, like you said it’s best to use predict_sequence which gives you full control of the DNA to pass to the model.

Also, when using Variant , does the model still reconstruct the full sequence internally?

Yes, internally we read the sequence from the organism’s fasta file, apply the variant bases and then run the forward pass twice (for REF and ALT).

There is a subtle difference for splicing, where we use the union of predicted splice site positions and RNA-seq data for the specific interval to generate splice junction predictions, which we can’t do for predict_sequence where we don’t know the genomic region. Practically this doesn’t make a huge difference to splicing performance, but just to be aware.

1 Like