Use alphagenome prediect wheat genome

In my recent exploration of genomic prediction tools, I utilized AlphaGenome, an advanced deep learning framework designed for modeling DNA sequences and predicting functional genomic elements such as RNA-seq profiles, splice sites, and chromatin accessibility. My goal was to apply this tool to wheat genome sequences, particularly focusing on gene regions annotated in chromosome 5A (e.g., chr5A:587411454-587423416), which contains key genes involved in developmental or stress-response pathways.

The workflow began with interval normalization: since AlphaGenome only accepts specific sequence lengths (such as 2KB, 16KB, 100KB, etc.), I used the .resize() method to extend my region of interest to exactly 16KB, ensuring that the center point remained unchanged to preserve biological context. I then retrieved the corresponding DNA sequence and passed it into the model via the predict_sequence() API, requesting outputs like RNA_SEQ, SPLICE_SITES, and SPLICE_SITE_USAGE.

One of the most powerful features of AlphaGenome is its ability to return high-resolution track data, including predictions for multiple tissues or conditions. This allowed me to compare predicted expression patterns across different assays and identify potential alternative splicing events through splice junction predictions. I visualized these results using built-in plotting components like plot_components.Tracks() and plot_components.TranscriptAnnotation(), which helped overlay transcript structures with predicted RNA expression levels.

However, there were some challenges. The requirement for fixed-length intervals sometimes led to the loss of biological context if the original region was much smaller than the target length. Also, while the API is well-documented, the lack of native support for plant genomes (especially wheat) meant that I had to manually map gene annotations and ensure correct strand handling. Additionally, the need to manually assign .interval fields to TrackData objects was cumbersome and could be error-prone.


:light_bulb: Suggestions

  1. Support for Plant Genomes Out-of-the-Box
    Currently, AlphaGenome seems optimized for mammalian genomes (e.g., human, mouse). Adding pre-trained models or annotation mappings for common crop species like wheat, rice, and maize would greatly enhance usability in agricultural genomics.

  2. Flexible Interval Handling
    Allow dynamic padding or trimming strategies (e.g., “pad_left”, “center”, “trim_start”) to better preserve biological relevance when resizing intervals.

  3. Mutable TrackData Objects
    Consider making TrackData mutable or providing helper functions to easily update attributes like .interval, rather than requiring manual use of dataclasses.replace().

  4. Batch Processing Improvements
    While batch prediction is possible using ThreadPoolExecutor, adding higher-level wrappers or integration with Dask/Biopython pipelines could streamline large-scale analyses.

  5. Visualization Enhancements
    Improve visualization components to allow easier customization of color schemes, labels, and multi-track comparison — especially useful for comparing variants or splice isoforms.


:white_check_mark: Conclusion

Overall, AlphaGenome provides a robust and flexible platform for genomic prediction tasks. Its integration of deep learning with genomic interval manipulation makes it ideal for studying gene regulation, splice variation, and expression dynamics. With minor enhancements to support non-model organisms like wheat, it could become a go-to tool for both basic research and applied crop breeding programs.
code I used in colab:

from IPython.display import clear_output
! pip install alphagenome
clear_output()

from alphagenome.data import gene_annotation, genome, track_data, transcript
from alphagenome.models import dna_client
from alphagenome.visualization import plot_components
from google.colab import userdata
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd

dna_model = dna_client.create(userdata.get('ALPHA_GENOME_API_KEY'))

sequence = 'ACATCATAACTATGCGCTCTTCTATTAATTGTTCAACAGTAATTTGTTTACCCACAGTTACTACGCTCATGAGAGAGATGCCTCTAGTGAAAGCTATGGCCCCCGCGTCCATTCATAGTATATTACTAAAATCCTAAATACCTTGCTGCAATTTATTTAATTGTTTTGTTTTACAATTTATCTATCTATCATTACCAGAATTAATCTTGCAATTAATGAGTACAAGGGGATTATTAACCCTCTTGCCTGCATTGGGTGCAAGTATTTGTTTTGTGTGTGTGCAATTTATCTTTGTTTGCGTGAATCTTCTATTGGTTCGATAAACATTGGTTCTTAACCGAGGGAAATACTATCTGCTACTATACTACATCACCCTTCCTCTTTGGGGAAATCCCAACGCCTGTCACAAGTAGTAGAAGAATTTCCGACGCCGTTGACGAGGAGGTTTCACCAAAAAAATTAGGTACCTGCACACACACACACACACCTTATTTTCTTGCTTTATTTATGCTTTCTTTCATGATGACTCAATAAAATAAAAATAAAGACATATGGATCCTCATCCACTTGCTAATCTTTTCAAACTTGCTGCTTATTGGGATACATTTATTGAATGAAAAACATGATTGCAATGTTGTTAATATTAATTCCATGAAAGTTAATTGTGCTAATGACAATGACTGGGGTGACGATCATAATGCTATGAATATTAAAAGTGGGTTTGGAATCTCACTATTTTGGAGAACAATCAATATTACGAAATTTCTGATAAAAGTTGGTTTGGAGAGGTCATGACTTTAGTTGATGTTAATATAATCCCACTATCTTGGAAGATGATAAAATTTGCATGCATGTGAATCATGGAGAGAATATTTTATATGATAGCTATATTGTTGAATTTGATTGCGATTCTACATGTAATTATTTTGAGAAAGGAAAGTATAGCTATAGAAATCTTCATGTTACTCAATTACCTCTCTTGATGTTCAAATTAGTAATGTCTCATCCTTCTTTCTTGCATTTGCATATGCTAAATATTGCTTAGTTTGATAATTAGTTTCATTATGAAATGCATTTACATAGGAAGTGGATTAGACTTAAATGTGATCATTACATGTTATATGACGCTCTCTTTGTGTTTAAATTCTTGTATTTTATGTGAGTATCATTGAAATCTAAAGCCTATCTTAATGACTATAAAGAGCGAGCTTGTTGGGAGACAACCCGATAGTTATCTTTATTTTTCTTTTCTGCCTTGTGAGTCAACATGGTTATTGCTACTGTAATGATTGTGTTTTATCTTTTATTTTGAGTCTGTGCTAAGTAAAGTCTTTGTGATGATTTAGATGATAGTTGAATTGATTCTGTGCAAAAACAGAAAGTTTCGCGCCCAGTATTTGAATTTCACAATAAATACTTGAGTTCTTATGACTAATGTGAATTTGTGGACTTTAAGTACTTCCACCCTTCCATATTTTGCTAGCCTCTTTGGTACCATGCATTGCTCGTTCTTACCTTGAGACTTGGTGCAAACTTCACCGGTGCATCCAAATCCTGTGGTATGACACACTTTATCACACATAAACTTTATTGCACCCTTCGTCAAAACAAACACCATACCTACCTATCATGACATTTTCATAGTCATTCAGAGATATATTGTCATGCAACTTCCACGATTACTATTCACATGACTCGAGTGTTCATTGTCATTTTACTTTGCATGATCATATAGAGCTGACATGATATTTGTGGCAAAGCCACCGTTCGACATTGTTATACATGTTACGCTAGATCATTGCACATCCTGACACACTGCCAGAGGCATTCATATGTAGTGATATCATTCGGTTTATCGAGTTGTAAGTAAAAAGAAGTGTGATCATCATTATAAGAGCGTTGTCCCAGTGAGGGAAAGAATGATGGAGACTAATGAGTCCTCGAAAAAGTGGGAATGAGGTTCATGATATTGTTCTAAAAAAGGCCAAAATATCGTCAGCGGGCTCAAACTGCATAGTTCATTGGTGTTGATTAAAAAGTGTGCTGTTTAGAACTAGTGCAGAGTATCCATATCAATCATTCCTTTTCTTTTATGGCATGTTTCTCTCGGTTTATGATGGCAGGAAATGTATGGGCGGCAGAACCTTTTGCGTTGGTGACTCTGATGTCATCCCTATATACTCTCTTTCAAGTCCAGATGGGAAGACGACGAAGACATAAAGGTCCTGATTTATTTATTTTCAAGAATCACAATATACACAATGCTCATATTTGATAGGTTAATTCACATAAACAACACCCAGTAGAGACGGGTATCATGGAAACTTGAAAATTCATCACTCAAGCTACATATAGATTACATGGTAAATTACTCGTACAGCCATCTCAGCCCAGCTGGGAGGGAAACTGAGGTGGACAAAGTGAAATATACACATCGCTGCAGCTTGCTACTTTACTCTGATTTCTTTTCCTTTCCCCTCCAAAGGGGTCAGGCGTGCTAGCAACCGCAACATACACCAGGCTGGCCGGTGCAACTTGTTACCCTCTACTGAATAGTACGCCTGTATGGGCTGGATGCCCTTCACCCGTTGATGTGGCTCACCATCCACGGTGGAAGCCCCGTCCGGGGTGGGGCCTGCGGCTGCACTGCCGCATCCTCTGCCCTCTCGCCTGTTGCCGCTGGATGAATGCTGCACAACCAAAGCGAACACAAGCATATTAGTATTATGTTACTTAAATGTGGCCCGACAGAACTGCATAGAGACCAAAAGTGGGCTGCAAGGAACAACCCGACCAATTCAAAAGATGGTTACTTGAACTCTGAATAAGATTAGACTTGTGATGTACATCATCACCTGGTATTTGCGGCAGGGGGAGCATCCCTCAGCATGAAGGAAGAAGATGAAGAGCTGGTTTGAGGCTGAGTTTGATCTTGCTGCGCCGCATGGGCCTTCTGCTTCTCCACGAGCTGTGGGGAAGGGACACGGACCTCTGTCAATAAAATTTGCTATACGGAACAATTTAGACCGGTTGGACCATAAGGTGATATAACGGGCTTACTTCCTTCTGGAGAACTTTATTCTCCTCCTGCAGTGACCTCTCCTGCAGCAAGAACGATGTAATGAGGTTACGTGCATGTAGACCAGTTACTTGCATACATGTTATAATGTCACAAATCTTGAAACAAGCTAAGGCTTCATGACAAGTTGACCAGTTCGAATACCGAATAAGGTATGCAAGGTTGACAGCTTACCTTCTTCTGAAGCTCAGAAATGGATTCGTGCATAAGTTGGTTCTGCAAGTACAAAAAATGAAAAAAATTAGTCATTTTTTTAACTATGAAGAGCATATGACTAAAATGCATATACATATTTTGAACATCTCAGTCTAGAATCTGATTTCAGATATACTTTGCTGAACTTCTCTGCAAGTGTTTTTCTTGTTTTTTTATATATTGTGCTGCTGTATCAAATCATTTAAATCAGTACCTTCCTGGATCTGATATGTTTCAGTGAGCTTTCCAGCTGCTGCTCCAGTTGCTGCAACTCCTTGAGATTCAAAGATTCAAGATCCTCTCCCATGAGATGCCTGCAGATGCAGGATGTAGAAATTCAGTCCCATTTATCTTCAGAGTGTATACAATATACTGGCAATCAACCAAAATCGTTACAAATTACTTTTGACATTTCTGTATTGTCTCAACCTTCGCCTTCAGTTTCCTATATTCGTGACACCAGTTTCCCTGCACAAACAAATAACAATCGCATTGCTAACTTTGGGTGGGACAGGAAGGAACAATGACAGAAATGCTACAAATAGGCTAAGTTAGTTCAGAAATGAACAAAGAGACTTTAACAACATTGACACCACCACCACCAACAACAACAACAAAGCCTTTAGTCCCAAACAAGTTGGGGTAGGCTAGAGGTGAAACCCATAAGATCTCGCAACCAACTCATGGCTCTGGCACATGGATAGCAAGCTTCCACGCACCCCTGTCCATAGCTAGCTCTTTGGCTTTTGGGTTTCATCTCCATAAGAGTGGCTGAGTTTTTACGTTGGCTCGCCAAGCCTATCACAACCCTCCTCCTTTACCCCGGGGGCATAAATCAGGAAGGATTGTTCTGTGCAGTGAACTTGAATTTTCATAACTTCAAATCAATTATGTCACTGCAACAAGTGTTACTATCAGATGTAAAAATCATATTTCGTAATATTAAAAATGCTGCAGCGGAGGTAAATATTTAAATAAAATTTGGAATATTTTTTGGGTTAGTAATTACATGTTTAAATAGTCCTCGTCATACGTACATGCTGAATCTGTTTGTCAGCAATCAGATAGAACTGGTTGGATCCCTCAGTTGGAGCTACAACTTGGTTTGTGCAAAAATTGGCCAAAAAATTGAAGCGAGGTCCTGGAGTTCCGAGTTACAAAAATAAGAATATTGAGCCTAACTGGATCTTCTAAGTCTGCAGGATCTACTTTTTATCCTAGATGGTTAGCCTTAGACAAGCTGTAATAAGAAACTGCAGCAATATCGCAAACAATACAGAAAAATTGGAGATGAGAACTAATGTCATGGGGCTATCCTTAATGGCCATTAGTAGCATAATACAATCCGAAGGACTGGAGATTCAACAGTTTATCCTAAGAACTTAAGATCAATGACCAATCTAGTAACAAGTCTTAATAATCTAGGATAGCATTTTGTTATTGCAATCAAAATATTGTGATTTTCACTGATGCACACAAGTATTTTAAAGCCATTATAATATATAAGCTTGTCAATTTTTGTTGCTAAGCCCTTCAAAAACTCAGCCGGCTAATCCAGAATTGCTGATCTGGCCATGCATTGCAGTTTACTACGTACTAATTTAGTAACTCTTAAAGTCAGATCACATTTTTGTAGTTCTTCCAAAGGCAGTATGTACGAATCAGCACGCTAGTCTAGTTCAGTAGCCATGAAAGTAAATTGGTCAAGTGGTATTTGGTTACATTATTCGACCATGCCTAAGCATTCATCGAAGAAAAACTGATAATAGGTTACAAGAGAATCAAATAGACCAACAATCCATAGTTGCAAGTCAATTAACTTCATAGCGCATGGAAATCATTCAAAATATAGTGGCCTACAAATAAATTTCAGAACATCTGATATCGGCAACAATTTATTTCATAGCTAAAGGAAAGCAAACCGCTTGTTTTTCATTTTTACCTGAATTTCAGATTCACTTGAAACGAGAACCTTTTCTGCATAAGAATAGCGCTCATACCGTTCAAGAATTTTGTCCATACTGCATAAGAGAAAGAAATATCTAGTCATTGGAAGTACATGTAAACAAGATTAATGTCCAGTATATGCACAATCTCAAGAATTGATATATGATACATTTTGACATATCAGAAGCTTTTCTGTAATGCGTTGTTTCATGTCAGTATATGCAAATGAGATTCAAATGTGGTATACAGCATCTGGTCAAATAGTTTGCAAAGCATTTCACTGTGACGTGTAAATGATAGAAAGAGCACGAAACTGACTTCAAATTAAGGTTCTGGTGTTGTTCAAGTTTTGTTGGGAAAATAGAGTGACCCTTTTCTTCTGTCTATAGACTTAACGAAATTTGCAAAACAGAAGTTTAGGGGTGGAATTGTGCTAGGCAAGGTTAGTATAATTTCTGCCAGATTCAAATTATTTAACTATAGTGCTAGAGCTTCAGTTTAAGAAACGTTCCTTGATGTTGTATTTAAGTCATTAACTTTTTTACTACTCATGTTATTCTCCTCCTATACCTTCTCCCTGTTTTCCCGTAGGCTTTACCCAGCATCCTAACTATTTGTGGAGGTCAACGTGTTCTCTGTCACCACAAACATCGTGTCCTCAATTGTCACAAGTACCCACAACACATGGAAAATTGTCAAACGCTACCCTATAACCTATAGTAGGAATGAACCAACAATTAGTTATAGTGTTGAACTATGTGCTAGCCTATAACCCGCAAATGGTTACAATGGTCATCAAGTCAAAATCAGTTCAAGAGTGTGCAACAACTCACGTATCCAATCAGCTAGACTCACAAACCGATAATATAGAAGGGTGTGACCAACAACAATTATCTGGGCAAGAGCCATGCCTAGATAAGTAAGACAACACGAATGTGAGAACCATTTTAAAACCTAATGTGTGGGTCCATGCAAAGGTATTGTCAGTACAAATATAGGTGGTAGTAGCAAATTTTCAAATGAAAGCTTTGATGCCAAAGCAGATAAATGCCCGTGTGAAAAGAAAACATGCACATAGTAAGGTGGTCCAAGGGATCCCAAGGGTTAGGCAACCCTAGAAAATAGGCGCGGCTCTTATCGGGGTGACGACACACCGGGTGGTCTTGCTGGTTTAGTTTTGCCTTTGACGAGTGGCTTGTCCACCGGGGTTCCTTATTGTAAGCATCGTGTTATCATGGTGTGAGGATGTTCCATCGGATGCATGTGGTAATTTATAATGGACTACTTGGTGCACCCCTGCAGGGCTAAATCTTTTCGGAAGCCGTGCCCGCGGTTATGTGGCGACTTAAAAATTTATAATATCCGATTTTAGAGAACTTGACACTGAACCCAATTAAAATACACCAACCGCGTGCGTTACGTGATCGTCTCTTTTCCAAGGAGTTCGGGAAGTGAACACGGTGGGGTTATGACTGACTCATAAGTAGTTCAGGATCACTTCTTGATCATTAATAGTTTGCGACCGCTATGTGTAGTTTACTCTTCTTACTCTTGTACTCGTAAGTTAGCCACCATACAAATGCTTAATGCTTCCAGCCTCCTCACCACTTAACCCTTCCATACCCATTAATCTTTGCTAGTTTTGCTACCCTTGGTAATGAGATTGCTGGGTCCCCATGGCTCATAGATTACTACAACAGGTGCAGGTATAGATAAAGCAATGCTTGACGTGAGAGCGATGCATGTTTGCTTTTGGAGTTCTTCTTCTGCTTCTTCGTCGATCATAGGATGGGTTCCAGTTCAGGAGCCCGGGATTAGCAGGGTAGATGTCGTTCTTCTTTTTGTTTGATTTCATCCATAGTCGGATCCTGCTCTCCTGTATGATGATTGTTGTGTATTGATGTATTCATGTTGTAGCTTGTGGCGAGTGTAAACCTTTATCTGGTATTCTCATCTATTTAGTGCATGGTATGTTGTAATGATATCCACCTCGCTATGCGCTCGAAATGCGATTCTGCTCCGATCATGAATTCATCACGTGATCGGGATAGAATCTTGGGTGCTACAAGTTGCTCTGCGCAGCCCCCCTCCAGGACTAGGTCCGTCTGGTCCCTAAAGGGGGGCAAAGGATCCGCGGGCTACTCCTCCACCTTATGCCAAAAATTGAAGGCGAGAGGGAGCTCCTCTGGATGCGCTGTGGACGACAAAGGGGATGGTGGATTGGAGGCGTGTGAGATTGGAATAAGAGGCGTCGAAGATAGCATTGAAGCTAGAGAGCGGAGGGCGCCGGCCTTTTACAGGCCACCGCAGGCGCCATTAGGGAGGCACTTGGGGACGCGGGTGGCAGCTAATGGCGGGTAGTCGAAGGGCGCGCCTCCCTATGAGAAATGCTGCAGCGACCTACTCGCACACCTCCAGTCTGCTCACATGCAAGCAAGAAGATGATGCCGCATGGCGGTCATCCAATGGACCGCCATAAGAATCCGAGGGCGCGGGTTGCTTCAAATCCGACAGGAGGCGCGAGGAGGAAGCTTAAAACTTTAGCGGTTGGTTTTGGGGCGAGCTCCTTACTGGACCTCTCCAACCTCCAAATATGGAGGCCACGAGACCTGTGTATGGAGCGAGATAAAAGAAATATAGCGACGGAAGCCATGAAGACTCTGCTAGAGTTGCTCTAAGACCTAGTTGTTGCATCATCATGAGATGGGAAATCTGATGTACCGTGGAGCAAGGACATCAAATTGTACAATGTACAACTTCAGGAAATACTTGGCATTCAATTTTCCTTTTTGAGAAAAGACATGCACATGGTAAGTTGCTAAGACCAAGGCAAGAGATACAACAACAAAGCGTTTAGTCCCAAATAAGTTGGATAGGGTAGAGCTGAAATCCATAAGATCTCAAAACCAAGTAATGGTTCTGGCGTGTGGATAGCTAACTTCCACACACCCCTGTCCATGACTAGTTCCGTGTGATACTTTAGTGCTTCAGATCTCTCTTCACGGACTCCTTCCATGTCAAGTTTGGTCTAGCCCAACCTCTCTTACATTATTAGCATGTATTAACCGCCTGCTATGCACGAAAGTTTCTAAAGGCCTACTTTGTATATATCCAAGCCATCTCAGACGATGTTCGACAAGCTTCCCTTCAATCGGTGCTACCCCGACTCTACATGTATATCATCATTTTGGACTCGATCCTTTCTTGGGTGACCACACATCCATCTCAAGATGCGCATTTCTGCTACACCCAATTGTTGAATATGTCTCATTTTAGTCGGCCAATACTCAGCGCCATACATTGCGGGGCGCCGTCCTATCGAACCTGCCTTTTAGCTTTCGTGGCACTATCTTGTCAAGAGAGAACACCAGAAACTTGGCATCATTTCCTCCATCCGACTTTGATTTAATGGCTCACATCTTCATCAATATCGGCATCCTTTTCCAACACTGATCCCAAATATCAAAAGGTGCCCTTCTGAGGTACCACCTACCCATTGAGGCTAACCTCCTCGTGCCAAGTAGTACTGAAACCGCACCTCATGTACTCAATTTTAGTTCTACTAAGCCTAAAAACTTTGATCCAAAGTTTGTCTCCATAGCTCTAACTTTCCATTATGCCTCGAGCTACTATCATTGACTAGCACCACATCAATAGTAAAGAGAATACACCATGGAATATCTCCTTGCATATCCCTTGTGACCTCATCCATCATCAAGGCAAAAAGATAAGGGCTCGAAGTTGACCCTTGGTGCTGTCCTATTTTATGCGGAAGGTCATCAGTGTCGCCATCACTTTTTTGAACACTTGTCATGAGATCATTGTACATGTCCTTGATGAGAGTAATGTACTTTATTGGGACTTTGTGTTTCTCCAAGGACCACCACATGACATTTTGCAGTATCTTATTGCAGGACTTTTTCAAGTCAATGAACATCATATGCAAGTCTTCTTTTTGCTCCCTGTATTTCTCCATAAGTTTTCATACCAAGAAAATGACTTCCATGATCAACCTCTCAGGCATGAAAACAAACTGATTTTTGGTCACGCTTGCCATTACTCTTAAGTGGTGCCCAATGACTCCTATAACATTATTCTTTGTAAAGCCTCCCTAATCTCAGACCCCTAGATTCATCTCTCCATTGAATAGCTTGTCGGAGTACTCCTGTCATCTATGCTTGATCTCCTTGTCCTTCACCAGGAGTTGATCTGCTCTATCCTTGATGCATTTGACTTGTTTGACATCCCTCGCCTTCCTCACGATCTTAGCCATCTTATAGATGTCCCTTGCGCCTTCCTTCCCGACCCTCATATGCCCGGGCACAATGTGGTCTTTTGTACTCCCTGGTAAACATTAGTGCACAATGTCCCTTGCGGTTATTCTCTCACTAATTTTTACTAGATGTCCCTTGCGCCTTCCTTCCCGACCCTCATATGCCCGGGCATAACCACTAGTAAAAATTAGTGCAAGAATATGGTCTTTTGTACTCCCTGGTAAACAAATATAAGAGCGTTCAGATCACTCAAGTAGTGACCTAAACGCTCTTATATTTCTTTACGGAGGGAGTATCTTGGTATGGCAAACAAAGATATATTTTTTGCCCTTGACAAAACACATTTTTCAAAGGATCTTCAGAGACTTGAATTTTGAATTTTGATTATCAAAGTTGTAATACATGGCTCTCCACCACAATACCCTCGCGTTTGTAAGATAAAAGTGGTACAGAAAGTACTAGATGATAACCAATCTATGATTAGTGGGTTAGGAGTGCGGTTGTACTCCCAACATGGAAAAAAGGACAATGAAAAGAAGCCACTTAGGCGCTAAGCAGACTACCGCCTCACTTTTGCCATAGGCCAAGGTTTAGGCGCCCGACTATGTTCTCCATGACCGATGAGATCTTTTTCCGGCGACGGTAGGAAAGGAGAAGCGAATGCAGGGAAGAAGCTGCAGACACCACCATTATCGCTACATACAAATGACGAATGAAAGAGACCTAACGATACAAAGAGACGGGAAAAGGGGGATGCAAAGAAAATGGACGGTCTTGTTTGGGCTATGTTGTCAAGAAAGGTTGAATCAGAGGAGGGATTTCTCACTCCAACTCACCACCGATACCTCCCTCTAGGCTCCGCACCATTGATTTCATTGTTGTAAAATCTCCTTCTTTGTCATACGAAGGCATAAGTGCGCCTAGGCGGTCATTGCACTAGGCATAGGCAAAATTCAGATTAGCGCCTACCGCTTTTTTAGTACCTTGACTCACAGCCCACCAAGGAAGATCGAACCAATGGTTTGACACACTGTGCGGATTATTCGTTCCGTGGGAGGCGACATTCCTGACGTTCCAGAGATCTTGAAGCTCCACAATTTTAGTCCTTACTAAACATTGTGTGAGTGTGTACGTGTGGTGGTAATGTGCACTTGTGTGTGCATCTACGTTGTAATCGGTGTTTCTCAAAAAAGAGATAGTCCTACTACCTTAAGCAGGAAAAGCACACATGTGATATGAAGACTGCATGTATATGTCTTTAATGCGCGTCAATTTGAAAATCAAACGCAAACTATTTGAGTGATGGGTCATAAGGTTTTGCTTAAAAACACACTTTCGTTATTGCTTGTGGTTTCATAATAGTTCATACATAAATTTATGCCATCCAAACTTTTTTATAATAAAAAAGAAATTCAAAAAGTTTTAAAATAAAATGTTAATATTAAAATTTCAAATTATAAAATTTGAGTTCAGAGAATTGTCAGTTGGAAAAAGTAGAAACATAACTTTTGTTCTTAATATTTAAATTTTGAGAATATGAAATTTATGAATAGTGACTGTGCATGTTTGTTGATGCATAGTTTGGTAAGATAAATGGTTGAACAAGAAGGGGGGCACAAAGGTACAAAATAACTAATTTTAACTAATGTAAAGCAATCAAGTTGTAACATAAATAATTATTCATTAATAGACTTCATGGTAAAACCCTTTTTGGCATACAGTCGATACCAACAACAATGACTGTGAGATTACATTTTAAATACATAACATTTCTATCATCTGGGTGGGCAAACACAATGTAAAATAAATAAATAAAGTAACAACCAGTTCGTGACATATCAAGACTATGCCATGAAATCCTGAGGTGTGATTCCATCAGCGGATACGCGAAAGTCGATAGGATCTGCTTACATGTACGTGCCCAGGATCTCCATATATTAGTTAACTTGTAACTGGGAGCTAACAAAATATCACTGCAGCAAAAATGTAACAATATGTGAAATGGTTAATAGATATGACACTTATGAAGTCATCTGCCGAGCTTCAACTATATTGTGTGCCTATTCAGTTTTTCTTCTATTTCATGGAATTAATACATGTAATGAGCAGATATAAACTCTAAATAACGGTTCTAGCTGATCATCTATATGTCGTTTGTGGACTAAGTAAAGATGTGGCAGGAGCAAAAATGGAAGTTGACTATCCCATAGCTTAGTCAGATATTCCTAGGAATAACTTCTGTTTTACTCTCAAGTATATAGGTTTAGGGACACAATAGGTGTAGAGGATAACTGGTTCCCAGACTTGGTTAAGTACTAGACCGACAACAAATGCATGTAACATAGACCCATCACACAGTCCCTAACAGCTTACTTCGTGTTTACTAGTCTAGAGATGAGTAAAGTTGGTTGCAATGTCAGCTTTGGACCTTTGCTATGTGTGGTAAATTAATCGACCGATAAGGAGATGGAGGATATAGAGTGAATAACAATTGGGTACATTAACCATTATCCAAGCCAGGTCCAGAAAATCAGCAGGCTACATCATTAGCTGCATTTTAGTGTATGCATATGATGCCTGCTCGATCAATGCTGATTTTGTGCCTGAGTCGGTTATATGCAGGCTATAGATGCCTTTTTAGAATATTGTAAGATCTCAAGATTTTAGTTCCGATCCTAAACCATGGAATATCGTGGTGCATCAGCGTGCTGTCTCATAGGCTCTAGACAAAGCATAGCAGTTACTTGTCCCCGTGAGCTACTTACACGTGCCTATCGTGCAGAGGTAGGCAAAGCAAACCAAATGACAAGGAACGACAGAGAAGAAAATAAAGCCGGCAAGCTGATTAATGCATGGTTACCAATTCGGATGTTCATGGCAGTCTTTTCCAAATGAACAGCACGGAAACAGGGGGTAATTTTATTCAGTACTTTGATCTGAGATACTGAAGTTCACAGTGCAGCACTAGCTTAAAATATGACTGCATCTATTTATATCTTAATTTAACCGACTTTGATTCTTACGCGTTTTCATCACGTGACCAGCTTGATACATGCATGCAAGCACCTATGCGTCAAACAACCAAGGAAGCTGAAGGCAAGCCAATAAAACTTTACCTACCCTTTTCCGTTAGCGCCGCGCATACCCTCAGTGGTACAAAACTAGAAGAGATTTAGCACCTCAACATACAGGTCCTTGCCATGAAGCTGATGGTAATGTTGCTTTCTCAGCAAGATTACAAACAACAATGGATATTTACTAAGCAGTAGCGATCCTTTCGACGGTATGATATATCTTACTTTCAAACCGTGGAGGCTGATCTAGTACTTCGCCGGACGGGAATTATGGAAGCAGACAGAAATGGGAAACTAACATGGACTAATGTCAGCGTGTGGGGGAAATAATGGATCAGTCTCCGTGATTTTATTGACTCCTTTACTTTGTGAAGAACCATCGCCAGCACCAACAAATATCACGGGCAAACTATGATGGTTGAAAGGGAAAAAAAGTATGCAGAAACTCTTATTTTATATTTGGAAGGAACTCTCTACTTTTTGGTTTGACTCTTCATGCCGGAGTTTCATCAGCTGCAGACAATAATTCCATGCATTAGCTGGTCAAGTGACTCTTTTCTCCGTGCGTGCCGTAGACAGTATCTCCAAGCAAGCAACAAAATGGCCGCATGGCCGGAGCTACCAATATGAATGATAGCAAAGCCTGTGGAACAAAATTCTGGAAATAACCGAGATGACGAGGTTCATTTGTTCAAGCAGCAAGCACAGATTTGATGAGGGAGACAACAAGGTATTGTAGCGTCTAGTTAAGATTCCAAATGAGAAGATGAGGTAGCATGCTAAATTGCGATGTCACTTTAACCCCATATATATTTTTTACGTAAATAACCCCAGCCCGATAGAAATATCAAGAGATTACCGTCTTAACCCTTCCACTTGGGGTGCCACCTGGCGCGTCCGGGACCAAAATTTCCCCAAAATACGCCAGATGAACAACCTTCCCGGCGCTGGCGGAGCATAAAATAGCCCGGCGACCCATCGCGCGCCCTACCCCCCATAGCAAGAAATCGCGGCCGCGCGGAACGCCAGCTCCCCCATCGATTCTAAGAAAAACGAGCAAGCAGCCACGGCAGAATCGGATGGAAACAGCGAGCGGGCAAACGGAATCTACCAAACAGCAGCGCGGAAATGGAGAAGGCCCTGGGGTCGGCCGGGTCGTCTCGCAGGACCAGAACGGTGCGCAGGAAATCGAAATCGAAGGCGTATTGGGGAACAAATTTAAAGACAGCGCGTGCTTAATTTACCATGACTCGGTGGAGAACTCGTAGAGCTTTCCCTTGGTGGAGAAGATGATGAGGCCGACCTCGGCGTCGCAGAGCACGGAGATCTCGTGCGCCTTCTTGAGAAGCCCCGAGCGGCGCTTGGAGAAGGTCACCTGCCGGTTGATCTTGTTCTCGATCCGCTTCAGCTGCACCTTCCCCCGCCCCATCTCCGCTCGAGAACCGGGCCAACCCTACGCCCCTACCCTCCAACACCGGCAGACCGCGACGGCTACTCCGACTGGCGCAGGCGGAGGCGAGGCGGCGGAGCCATGGCTATCAGGTGGTTGGGTGAGGACGTGAGGTGGAAGAGAGGGGAGGAGAGGGAGGATGGCCAGGCCAAAACGAGGATTCCGGCAGGGGGGGAGGGGTTTTTAAAGGGATCTGGCCCCGAGCGCGGTATGCAATACCGGGCTGGTCTGGAGCATAGCCGCTGTCGAGCACCCATTGGCCCGGCCCGCTTTGCGGGGCCCCGCGGTCCTGCAGCCACACGATGCCCCACCCCGGGCCCGCCGGGCGTCGACGTGTCGAACGATCGTACGTGCGAATCTCCGGATTTTGCTTTCCCCAAATCATTTTCTCTCGCCACAAACACACACCACAGAGCAAAAAAACGAGCAGAATTTTTCCTTTCATAATATTCTACTGTACCCTAGGCCTAGAAGAAGGGAAAGAGCGGAGTTTTTTCCTTTAGCATTTCCCCTTTTTCCTTTTTTGCTTTTTTGGTCACGGCGGCGGGGGACGAAAGAGGAAATGCTGGCTGGCTAGGTCAGGGCGGCTCGGGACGGGTCTGGTCCGTCCAGGGAAAATGGAATGGAAGGCGAGATGACGGTCGTGGGGGCGGGCTGCCACAGTAGGAGGCTGCTTTCCTTGCTCGTTTTGGGCCGTCTCGCTTGCCCCGTTTGGGAGAGGTCGGCTGCGGGGCGGGCGGGCGGGCGCGGCGGGCCGCCCGGGTCGTTCGCGCCGCGTCCATCGCGTGCATGATTGAGCTCGACTTTTGACATGAAGCGTTGCTGCAGTGCAGTGCAGTGCGCTGCGTGGATCGAGTGGGCGCAGTGTATTTTTCTACTTTTGGCCGACCCGCGTATTTCTGGAATGAATGAAAGATACTCGCCAGTCCTAGGAGTAGGGGTACTGTGCTGTGCGTGTATACCGACCAGCCATCCATCAATACACTGTTGTCAAGCGTCTCATCGTGGGGTTGTCCGACTCCATACCAATTATATACTCCAGAAACGGTTCGTGTATTTTTCTTTATATATGGAGAAAAAAAACGCTCTATAGAGAACTGTACATGTGATATATCTGCACTTCTGCAGTGACCAGCTTCTCGATATATAGGGCCGTGCTGAAACCATTTGCAGCACTAAAATGGTTTCTACACTAAAACTGCTGTTGGGCCTAATACAACATTTAGTAGTTCAGTGTTCGAACATGTGAATCATGACCCATTTTATTGTTTATGCTGACAAACAAACAAAAAGCCCATATGCATGCTTGTCAAATGACCACTTTGCAACCATTTGACGTAACCAGCAACCACCAATGCTGCTCTTTATGACCGATGATGACAGATCTCGCAATCATATCATGCACATCTCTTTTCTTTTCTTTTTTTTCTTTGCGAATATGCCATGTACATCTTTTTTCTTTTTCTTTATTCTGCAAATATGCCATGTATTAACTCCGTCCCAAATTATTTATCTTAAATTTGTCTAGATACCGAGATATCTGACACTAAAACGTGTCTAGATACTATATCTAGACAAATGTAAGACAACTAATTCGAAACATCTCTGCGATTGAAAATATACAAACATGAACATCCCTAAACTTTGCGGTGTATCTCCAAGAATGCAAAAACGATATTTAGATCACACGCACTCTATATAATCAGTACTTCAAAAACATGTCAGAGTTCCAAATGCTTGTGAAAAAATGGGTATAGCAGATATAATAATTTCTAGTTTACTGTAACTTAAAGACTATTTCAGGCAAGTTAGTGATTACTTGGTTTCCTATTTATCTTTAGCTCACGGATGATCCTTTTATTGCAAATTTTGCATGTATATGGACCTACCTATTATCTAAGCTAAGTTTTCACAATTGTTAGAACATCTCCACTCGTTCGGCCCCAGGGGCTAGAAATAGCGCCGTCCTGGGGGAGTGCCGGCTGAAATATCGGCCTGGGGGCGATCTGGTTCCCAGCCGCCGTCCTCAGGTGCTGATTTTGGCCCACTTTCAGTGCAAATTGGCCCACTTTTCAGCTCATTTTGGGCGCAAATTACCCCACTATCGGCGCAAATTGGCCCACTATCGGCCGGTATTCGGCGTGCTTCGGCACAAATTCAACAGAAGCTATTTTTTTGTCACGTAGTTCATCATAGAAAATCAATAGAAATCAAATAGTTCAATACAAATAATATAGTTCAACAAATAAAACATCACACGTCGAACTAGGCGTTACCCTTGAGCCTCCATAGGTGCTCCACTAGATCCTGCTGCAGTTGTTGATGCACCAGTGGGTCTCGGATCTCCTGACGCATATTGAGGAAG'

# Define the tissues/cell-types to predict expression for. 
# [I don't have this data of wheat, so i used the data in tour of alphagenome ]
ontology_terms = [
    'UBERON:0001159',  # Colon - Sigmoid.
    'UBERON:0001155',  # Colon - Transverse.
]

output = dna_model.predict_sequence(
    sequence=sequence,
    requested_outputs=[
        dna_client.OutputType.RNA_SEQ,
        dna_client.OutputType.SPLICE_SITES,
        dna_client.OutputType.SPLICE_SITE_USAGE,
        dna_client.OutputType.SPLICE_JUNCTIONS,
    ],
    ontology_terms=ontology_terms 
)

exons = [
    genome.Interval('chr22', 17411454, 17411932, '-'),
    genome.Interval('chr22', 17412098, 17412210, '-'),
    genome.Interval('chr22', 17412303, 17412344, '-'),
    genome.Interval('chr22', 17412496, 17412537, '-'),
    genome.Interval('chr22', 17412731, 17412830, '-'),
    genome.Interval('chr22', 17412920, 17412984, '-'),
    genome.Interval('chr22', 17414459, 17414537, '-'),
    genome.Interval('chr22', 17423056, 17423416, '-'),
]

# build a Transcript

cds = [
    genome.Interval('chr22', 17411824, 17411932, '-'),
    genome.Interval('chr22', 17412098, 17412210, '-'),
    genome.Interval('chr22', 17412303, 17412344, '-'),
    genome.Interval('chr22', 17412496, 17412537, '-'),
    genome.Interval('chr22', 17412731, 17412830, '-'),
    genome.Interval('chr22', 17412920, 17412984, '-'),
    genome.Interval('chr22', 17414459, 17414537, '-'),
    genome.Interval('chr22', 17423056, 17423240, '-'),
]

start_codon = [genome.Interval('chr22', 17423240, 17423243, '-')]  
stop_codon = [genome.Interval('chr22', 17411824, 17411827, '-')] 

my_transcript = transcript.Transcript(
    exons=exons,
    cds=cds,
    start_codon=start_codon,
    stop_codon=stop_codon,
    transcript_id='TraesCS5A02G391700.1',
    gene_id='TraesCS5A02G391700',
    protein_id='TraesCS5A02G391700.1',
    uniprot_id=None,
    info={'gene_name': 'VRN-A1'}
)

#chr5A:587409243-587425627
interval = genome.Interval('chr22', 17_409_243, 17_425_627).resize(
    dna_client.SEQUENCE_LENGTH_16KB
)

import dataclasses

new_rna_seq = dataclasses.replace(output.rna_seq, interval=interval)
new_splice_sites = dataclasses.replace(output.splice_sites, interval=interval)
new_splice_site_usage = dataclasses.replace(output.splice_site_usage, interval=interval)

new_output = dataclasses.replace(output, rna_seq=new_rna_seq, splice_sites=new_splice_sites, splice_site_usage=new_splice_site_usage)

# Build plot.
plot = plot_components.plot(
    [
        plot_components.TranscriptAnnotation([my_transcript]),
        plot_components.Tracks(
            tdata=new_output.rna_seq,
            ylabel_template='RNA_SEQ: {biosample_name} ({strand})\n{name}',
        ),
        plot_components.TranscriptAnnotation([my_transcript]),
        plot_components.Tracks(
            tdata=new_output.splice_sites,
            ylabel_template='SPLICE_SITES'
        ),
        plot_components.TranscriptAnnotation([my_transcript]),
        plot_components.Tracks(
            tdata=new_output.splice_site_usage,
            ylabel_template='SPLICE_SITE_USAGE'
        )
    ],
    interval=interval,
    title='Predicted RNA Expression (RNA_SEQ) for the sequence',
)
6 Likes

Thank you so much for your great post! It’s exciting to see to model being used on novel tasks, and your constructive feedback is greatly appreciated. To respond to your suggestions:

  1. Support for Plant genomes out-of-the-box

This is something we’re very interested in but don’t have plans to add in the short term.

  1. Flexible Interval Handling

The Interval class already has some functions for padding, resizing and trimming, and our TrackData classes also have some functionality for resizing predictions, but given your use of predict_sequence maybe you mean raw sequence strings? We find Python’s string API pretty helpful in this regard, making it pretty easy to center, trim and pad.

  1. Mutable TrackData objects

Thanks for this feedback! We’ve just added support to predict_sequence that allows optionally passing an interval which is used to construct TrackData objects, which should remove the need to update the TrackData object.

  1. Batch Processing Improvements

We did consider adding more pipeline support, but we made the conscious decision to keep the alphagenome package as lean as possible. We’d love to see what others build here though to make pipelining easier!

  1. Visualization Enhancements

We do have some support for color customization, axis labelling, overlaid track components etc., but if there’s something specific that’s missing, we can try and add something for sure!

Thanks again!

1 Like