Predict gene-level expression given sequences

Hi, Thank you for the great work!
I’m interested in how mutated sequences affect gene expression, and I read there is a post showing how to do the RNA expression prediction given a sequence (Uploading sequences to AlphaGenome). I found that the RNA-seq expression predictions are given for each nucliotide.

I’m wondering if it is possible to predict the RNA-seq expression on gene level, given a sequence? If not, could you share how to aggregate the predictions per basepair on gene level?

Thank you!

Hi @pumpkinguagua,

Assuming you don’t want to provide a custom sequence, you can calculate the expression by gene by using score_interval, which takes an interval and generates a “score”, which if you use the GeneMaskScorer will give you a score per gene in the region. Example:

from alphagenome.data import genome
from alphagenome.models import dna_client
from alphagenome.models import interval_scorers

model = dna_client.create('MY_KEY')

interval = genome.Interval('chr1', start=2**20, end=2**20 + 2**20)

scores = model.score_interval(
    interval,
    interval_scorers=[
        interval_scorers.GeneMaskScorer(
            requested_output=dna_client.OutputType.RNA_SEQ,
            width=200_001,
            aggregation_type=interval_scorers.IntervalAggregationType.MEAN,
        )
    ],
)[0]

If you want to do this on a custom sequence, unfortunately you’ll have to manually aggregate the predictions for each gene region in the prediction. We don’t have much in the way of helpers here, but you’d make a predict_sequence request, get the RNA_SEQ response and then slice and aggregate for each gene.

Hope that helps!

Hi! Is it possible to make it in an ontology specific way?

Hi,

I tested the model.score_interval and compared its predictions with those from model.score_variant.
I found that the correlation between the two outputs is quite high (~0.98), which is great, but the absolute values differ substantially for highly expressed genes.

  1. Why could this difference occur, especially for genes with high predicted expression?

  2. Also, could someone clarify what the width parameter in GeneMaskScorer actually means in practice?

I’ve read this thread, but the explanation there (“The width of the target interval to include in the aggregation”) is still not fully clear to me — does it define the window size around the gene body used for aggregation, or something else?

Thanks in advance!

@Ilya_Dyugay sorry for slow reply! Currently we return all ontology predictions so you’d need to filter to the ontologies of interest on the client-side.

@dpanc see this question for why there might be small differences, and the width parameter is effectively a center-mask that’s applied around the center of the interval before aggregating.