Exporting output data

Hey there,

I have been setting up a pipeline for variant prediction and would like to export the values underlying the track visualization.

However, I ran into a problem when trying to acess the RNA-Seq data, as there are multiple assays stored in output.reference.rnaseq and I don’t understand how to access/export only the values of one RNA-Seq assay at a time (Eg. I just want the values for total RNA-Seq on the positive strand).

Would really appreciate some help on this!

Hi There!

Thanks for reaching out.

For any given biosample you may have up to four output tracks for RNAseq: ‘polyA plus RNA-seq’ and ‘total RNA-seq’, each for both the positive and negative strand. To extract values for a specific RNA-seq assay and strand, you can use the filtering methods available on the TrackData object. First, isolate the strand using .filter_to_positive_strand(). Next, apply .filter_tracks() with a boolean mask on the .metadata DataFrame to select the desired assay (e.g., ‘total RNA-seq’). Finally, access the .values attribute to get the raw numpy array for exporting.

# 1. Filter to positive strand
pos_tracks = output.reference.rna_seq.filter_to_positive_strand()

# 2. Filter to specific assay
mask = pos_tracks.metadata['Assay title'] == 'total RNA-seq'
specific_track = pos_tracks.filter_tracks(mask.values)

# 3. Export values
exported_values = specific_track.values

Kind regards,
Tumi

Thank you so much, that helped a lot.
However, I ran into some trouble with the Chip Histone samples.
Since I only want to export the H3K27ac data, I tried this approach since there is no strand information in Chip Histone:

# 1. Filter to specific assay:
mask = output.reference.chip_histone.metadata["histone_mark"] == "H3K27ac"
specific_track = output.reference.chip_histone.filter_tracks(mask.values)

#2. Export
exported_values = specific.track.values

However I when exporting I get the error that the arrays must be of the same lenght. Which makes me believe that I am either not exporting the correct data or the interval input is wrong.

Hi There!

Thanks for the follow-up.

I havent been able to reproduce the error. After predicting your variant with this:

output = dna_model.predict_variant(
  interval=interval.resize(dna_client.SEQUENCE_LENGTH_1MB),
    variant=variant,
    requested_outputs=[dna_client.OutputType.CHIP_HISTONE],
    ontology_terms=target_curie,
)

The shape of the reference results will be (size of spatial axis, number of ChiP-histone subtypes)

output.reference.chip_histone.values.shape
(8192, 6)

You then apply your H3K27ac mask (small error in your code: . instead of _ in line 3)

mask = output.reference.chip_histone.metadata["histone_mark"] == "H3K27ac"
specific_track = output.reference.chip_histone.filter_tracks(mask.values)
exported_values = specific_track.values

Exported values now only holds H3K27ac data

exported_values.shape
(8192, 1) 

Can you send me a code snippet that reproduces the error? Then we can also check your concerns about the input interval

Kind regards,
Tumi