Hi everyone! I am wondering whether it is possible to upload a WT vs a modified sequence (e.g. mutation or missing 500-1000bp) to analyze the gene expression and open chromatin predictions?
I have been reading the tutorials, quick guide, etc., but I get the impression that I cannot upload the sequence itself?
Thank you!
Hello,
If you can form the modification as a Variant
you can make a prediction via predict_variant
.
Otherwise you should be able to use predict_sequence
to make two predictions (the WT and modified sequence) and then compare and changes in expression. For example:
from alphagenome.models import dna_client
model = dna_client.create('my_key')
reference = 'A' * dna_client.SEQUENCE_LENGTH_2KB
alternate = 'G' * dna_client.SEQUENCE_LENGTH_2KB
ref_predictions = model.predict_sequence(
reference,
requested_outputs=[dna_client.OutputType.RNA_SEQ],
ontology_terms=['UBERON:0001496'],
)
alt_predictions = model.predict_sequence(
reference,
requested_outputs=[dna_client.OutputType.RNA_SEQ],
ontology_terms=['UBERON:0001496'],
)
You can then do e.g. ref_predictions.rna_seq - alt_predictions.rna_seq
to compute the diff.
Hope that helps!
Hi! Thanks a lot for the response
I basically have a 500 000 bp sequence as WT and an altered one of 499 600 bp. Shall I just upload them as follows?:
WT_reference = “copy paste the 500 000 bp sequence” * dna_client.SEQUENCE_LENGTH_0.57KB
P_altered = “copy paste the 499 600 bp sequence” * dna_cliente.SEQUENCE_LENGTH_0.49MB
Thanks a lot again!
Not quite: you’ll need to pad the sequence strings to be a supported length, which you can do using Python string methods. So for your example:
WT_reference = "GATTACA"
WT_reference = WT_reference.center(dna_client.SEQUENCE_LENGTH_500KB, 'N')
P_altered = 'GATTACA'
P_altered = P_altered.center(dna_client.SEQUENCE_LENGTH_500KB, 'N')
Will pad the WT and P_altered references with N
’s to be a supported 500KB length. You can then make the two predictions and compare using e.g. OverlaidTracks
(see the Visualizing predictions — AlphaGenome for examples).
Thanks again for your great work! Just a quick clarification:
If I format modifications as Variant
, is it possible to combine multiple variants together for a single prediction (i.e., one forward pass within same context window)? Or in that case, is it better to use predict_sequence
instead?
Also, when using Variant
, does the model still reconstruct the full sequence internally? I assume yes — but if not, is there any known difference in either speed or output performance compared to predict_sequence
?
Thank you again
If I format modifications as
Variant
, is it possible to combine multiple variants together for a single prediction (i.e., one forward pass within same context window)?
No, our variant pipeline can only support single position edits. For more complicated variants, like you said it’s best to use predict_sequence
which gives you full control of the DNA to pass to the model.
Also, when using
Variant
, does the model still reconstruct the full sequence internally?
Yes, internally we read the sequence from the organism’s fasta file, apply the variant bases and then run the forward pass twice (for REF and ALT).
There is a subtle difference for splicing, where we use the union of predicted splice site positions and RNA-seq data for the specific interval to generate splice junction predictions, which we can’t do for predict_sequence
where we don’t know the genomic region. Practically this doesn’t make a huge difference to splicing performance, but just to be aware.