Identifying Regions of Allelic Disruption

Adam_Miranda · July 3, 2025, 9:21pm

Hello,

I am trying to use the output of dna_model.predict_variant() to identify the genomic regions in which the the predicted values for the reference and alternate alleles significantly deviate. These can be seen easily in the sashimi genome track images shown in the tutorial, but I would like to extract these regions and the assay track they are identified in as a bed style table. Example below:

Chromosome	Start	End	REF/Alt Score Difference	Assay	TF/HistoneModification	Gene
chr1	1000000	1001000	-1.5	ChIPseq	FAVtf	NA
chr1	1003000	1004000	5	RNAseq	NA	FAVgene
chr1	1000000	1001000	-10	ATACseq	NA	NA

All of the information to create type of file this seems to be within the output object, but I’m struggling to parse it and get what I need out of it. Any help with the data wrangling would be greatly appreciated.

Clare_Bycroft · July 10, 2025, 9:29am

Hi!

You will have to define what you mean by ‘significantly deviate’, but the dna_model.predict_variant() output has everything you need. For instance, if you run this query for a variant and interval:

variant_output = dna_model.predict_variant(
interval=interval,
variant=variant,
requested_outputs=[dna_client.OutputType.RNA_SEQ],
ontology_terms=['UBERON:0001157'],
) # Colon - Transverse.

variant_output contains two TrackData objects, each containing an np.array, where the rows are positions relative to the requested interval, and columns are different tracks.

If you take the difference between these objects:

alt_minus_ref = variant_output.alternate.rna_seq - variant_output.reference.rna_seq

it will return another TrackData object of the same shape, containing the differences between the ref and alt predictions across the requested interval.

The numpy array alt_minus_ref.values will contain the values you can use to identify which regions in the interval have large differences.

For an example of this in practice, along with a visualisation, see the TAL1 locus tutorial.

Once you have identified regions with large differences (you will have to do this yourself) as 0-based indices of the TrackData np.array, e.g. (100,150). To map these indices to Chromosome, Start, End you can use:

alt_minus_ref_sliced = alt_minus_ref.slice_by_positions(100,150)

alt_minus_ref_sliced.interval will contain the chromosome, start, end for that region.
You can also use alt_minus_ref_sliced.values to construct ‘REF/ALT Score Difference’ (e.g. by taking the mean value across the positional axis).

Your other columns can be sourced from the Track metadata, which is contained in alt_minus_ref.metadata .

See TrackData documentation for the structure and available manipulations of the outputs.

I hope that helps!

Topic		Replies	Views
Uploading sequences to AlphaGenome Help & Support	5	606	July 17, 2025
Coordiate change when predicting long variants Feedback & Feature Requests	3	422	August 26, 2025
Overlaying Reference and Alternate variant tracks Help & Support	2	292	July 12, 2025
Validation of API usage Help & Support	5	189	September 16, 2025
Why are there different tracks for different cell line Help & Support	4	52	October 1, 2025

Identifying Regions of Allelic Disruption

Related topics