Limiting Batch SNP Predictions to Specific Biosamples or Assays to Reduce Runtime and File Size

Hello Team,

I recently ran a batch prediction on 3,000 autoimmune‑disease–associated SNPs. I know that when predicting individual variants we can specify which assays or biosamples to include, but I was wondering how to narrow down the output for a batch run to only the relevant cell types—specifically T cells. The job took over an hour and generated a 9 GB dataframe containing many assays and biosamples that I don’t need. Is there a way to specify which biosamples or assays to include when scoring a batch of variants, so that we can avoid computing predictions for irrelevant data?

Thank you!

Best regards,
Maryam

Hi @Maryam_Dashtiahangar , welcome to the forum!

We currently don’t have a way to filter score_variant requests, but you should be able to filter the responses to reduce the dimensionality. There’s no compute overhead per se (we compute all assays in parallel regardless), but you’re right in that we do send back predictions that you don’t necessarily want, but they should be pretty small.

Example of filtering by ontology CL:0000084:

from alphagenome import colab_utils
from alphagenome.data import genome
from alphagenome.models import dna_client
from alphagenome.models import variant_scorers


model = dna_client.create(colab_utils.get_api_key())
variant = genome.Variant.from_str('chr10:120714877:G>T')
interval = variant.reference_interval.resize(2**20)

scores = model.score_variant(interval, variant)
scores = variant_scorers.tidy_scores(scores)

filtered_scores = scores[scores['ontology_curie'] == 'CL:0000084']

Hope this helps!