Hi!
You will have to define what you mean by ‘significantly deviate’, but the dna_model.predict_variant()
output has everything you need. For instance, if you run this query for a variant and interval:
variant_output = dna_model.predict_variant(
interval=interval,
variant=variant,
requested_outputs=[dna_client.OutputType.RNA_SEQ],
ontology_terms=['UBERON:0001157'],
) # Colon - Transverse.
variant_output
contains two TrackData objects, each containing an np.array, where the rows are positions relative to the requested interval, and columns are different tracks.
If you take the difference between these objects:
alt_minus_ref = variant_output.alternate.rna_seq - variant_output.reference.rna_seq
it will return another TrackData object of the same shape, containing the differences between the ref and alt predictions across the requested interval.
The numpy array alt_minus_ref.values
will contain the values you can use to identify which regions in the interval have large differences.
For an example of this in practice, along with a visualisation, see the TAL1 locus tutorial.
Once you have identified regions with large differences (you will have to do this yourself) as 0-based indices of the TrackData np.array, e.g. (100,150). To map these indices to Chromosome, Start, End you can use:
alt_minus_ref_sliced = alt_minus_ref.slice_by_positions(100,150)
alt_minus_ref_sliced.interval
will contain the chromosome, start, end for that region.
You can also use alt_minus_ref_sliced.values
to construct ‘REF/ALT Score Difference’ (e.g. by taking the mean value across the positional axis).
Your other columns can be sourced from the Track metadata, which is contained in alt_minus_ref.metadata
.
See TrackData documentation for the structure and available manipulations of the outputs.
I hope that helps!