Hi,
I wanted to confirm the correct usage of the AlphaGenome API. I am calculating the loss and correlation on shorter regulatory regions (cCREs from the ENCODE registry) for 4 biosamples: (“HepG2": “EFO:0001187”, “IMR-90”: “EFO:0001196”, “K562”: “EFO:0002067”, “GM12878”: "EFO:0002784”). I am using the interval API for these calculations, and comparing them with the actual targets for ATAC and DNase, computed using a process similar to the Alphagenome paper.
I am currently using the AlphaGenome API. To ensure that I am using the predict_interval API correctly, I would appreciate your confirmation on the following usage.
Specifically, my implementation looks like this:
# Extract fields from dataloader sample
chromosome = sample['chromosome'][0]
start = sample['start'][0].item()
end = sample['end'][0].item()
strand = sample['strand'][0]
biosample_id = sample["biosample_id"][0]
# I use this to extend my region of interest to meet the minimum input requirement to the AlphaGenome API.
interval = genome.Interval(
chromosome=chromosome, start=start, end=end, strand=strand
).resize(dna_client.SEQUENCE_LENGTH_2KB)
…
output = alphagenome_model.predict_interval(
interval=interval,
requested_outputs=[dna_client.OutputType.ATAC, dna_client.OutputType.DNASE],
ontology_terms=[biosample_id],
)
# Extract signal tracks on the original interval
orig_interval = genome.Interval(chromosome=chromosome, start=start, end=end, strand=strand)
atac_td = output.atac.slice_by_interval(orig_interval, match_resolution=True)
dnase_td = output.dnase.slice_by_interval(orig_interval, match_resolution=True)
atac_values = atac_td.values.squeeze()
dnase_values = dnase_td.values.squeeze()
if strand == "-":
atac_values = atac_values[::-1]
dnase_values = dnase_values[::-1]
I would be grateful if you could confirm whether:
1. Interval preparation: Resizing the input genomic interval to dna_client.SEQUENCE_LENGTH_2KB before calling predict_interval is the correct approach.
2. Output extraction: Slicing back to the original interval with slice_by_interval(…, match_resolution=True) and reversing the values for the negative strand is the appropriate way to align predictions (i.e. is the output of the model in the forward direction).
3. The outputs are unprocessed (i.e. the outputs are the raw assay value predictions after reversing smoothening and scaling by non-zero average)? The targets I’m testing against have not been smoothened and divided by the non-zero-average.
4. I am using max sequence target lengths of ~350 base pairs. I have noticed that the per-base correlations specifically for DNAse-seq are much lower (0.2-0.4) than what was reported on the longer intervals in the AlphaGenome paper. I wanted to ask if your team had any measured correlations for independent cCRE sequences that had readings that align with the correlations reported in the paper.
Thanks,