Identifying Regions of Allelic Disruption

Hello,

I am trying to use the output of dna_model.predict_variant() to identify the genomic regions in which the the predicted values for the reference and alternate alleles significantly deviate. These can be seen easily in the sashimi genome track images shown in the tutorial, but I would like to extract these regions and the assay track they are identified in as a bed style table. Example below:

Chromosome Start End REF/Alt Score Difference Assay TF/HistoneModification Gene
chr1 1000000 1001000 -1.5 ChIPseq FAVtf NA
chr1 1003000 1004000 5 RNAseq NA FAVgene
chr1 1000000 1001000 -10 ATACseq NA NA

All of the information to create type of file this seems to be within the output object, but I’m struggling to parse it and get what I need out of it. Any help with the data wrangling would be greatly appreciated.

Hi!

You will have to define what you mean by ‘significantly deviate’, but the dna_model.predict_variant() output has everything you need. For instance, if you run this query for a variant and interval:

variant_output = dna_model.predict_variant(
interval=interval,
variant=variant,
requested_outputs=[dna_client.OutputType.RNA_SEQ],
ontology_terms=['UBERON:0001157'],
) # Colon - Transverse.

variant_output contains two TrackData objects, each containing an np.array, where the rows are positions relative to the requested interval, and columns are different tracks.

If you take the difference between these objects:

alt_minus_ref = variant_output.alternate.rna_seq - variant_output.reference.rna_seq

it will return another TrackData object of the same shape, containing the differences between the ref and alt predictions across the requested interval.

The numpy array alt_minus_ref.values will contain the values you can use to identify which regions in the interval have large differences.

For an example of this in practice, along with a visualisation, see the TAL1 locus tutorial.

Once you have identified regions with large differences (you will have to do this yourself) as 0-based indices of the TrackData np.array, e.g. (100,150). To map these indices to Chromosome, Start, End you can use:

alt_minus_ref_sliced = alt_minus_ref.slice_by_positions(100,150)

alt_minus_ref_sliced.interval will contain the chromosome, start, end for that region.
You can also use alt_minus_ref_sliced.values to construct ‘REF/ALT Score Difference’ (e.g. by taking the mean value across the positional axis).

Your other columns can be sourced from the Track metadata, which is contained in alt_minus_ref.metadata .

See TrackData documentation for the structure and available manipulations of the outputs.

I hope that helps!