Validation of API usage

Hi,

I wanted to confirm the correct usage of the AlphaGenome API. I am calculating the loss and correlation on shorter regulatory regions (cCREs from the ENCODE registry) for 4 biosamples: (“HepG2": “EFO:0001187”, “IMR-90”: “EFO:0001196”, “K562”: “EFO:0002067”, “GM12878”: "EFO:0002784”). I am using the interval API for these calculations, and comparing them with the actual targets for ATAC and DNase, computed using a process similar to the Alphagenome paper.

I am currently using the AlphaGenome API. To ensure that I am using the predict_interval API correctly, I would appreciate your confirmation on the following usage.

Specifically, my implementation looks like this:

# Extract fields from dataloader sample
chromosome = sample['chromosome'][0]
start = sample['start'][0].item()
end = sample['end'][0].item()
strand = sample['strand'][0]
biosample_id = sample["biosample_id"][0]

# I use this to extend my region of interest to meet the minimum input requirement to the AlphaGenome API.

interval = genome.Interval(
    chromosome=chromosome, start=start, end=end, strand=strand
).resize(dna_client.SEQUENCE_LENGTH_2KB)

…

output = alphagenome_model.predict_interval(
         interval=interval,
         requested_outputs=[dna_client.OutputType.ATAC, dna_client.OutputType.DNASE],
         ontology_terms=[biosample_id],
         )

# Extract signal tracks on the original interval
orig_interval = genome.Interval(chromosome=chromosome, start=start, end=end, strand=strand)
atac_td = output.atac.slice_by_interval(orig_interval, match_resolution=True)
dnase_td = output.dnase.slice_by_interval(orig_interval, match_resolution=True)

atac_values = atac_td.values.squeeze()
dnase_values = dnase_td.values.squeeze()

if strand == "-":
    atac_values = atac_values[::-1]
    dnase_values = dnase_values[::-1]

I would be grateful if you could confirm whether:

1. Interval preparation: Resizing the input genomic interval to dna_client.SEQUENCE_LENGTH_2KB before calling predict_interval is the correct approach.

2. Output extraction: Slicing back to the original interval with slice_by_interval(…, match_resolution=True) and reversing the values for the negative strand is the appropriate way to align predictions (i.e. is the output of the model in the forward direction).

3. The outputs are unprocessed (i.e. the outputs are the raw assay value predictions after reversing smoothening and scaling by non-zero average)? The targets I’m testing against have not been smoothened and divided by the non-zero-average.

4. I am using max sequence target lengths of ~350 base pairs. I have noticed that the per-base correlations specifically for DNAse-seq are much lower (0.2-0.4) than what was reported on the longer intervals in the AlphaGenome paper. I wanted to ask if your team had any measured correlations for independent cCRE sequences that had readings that align with the correlations reported in the paper.

Thanks,

Hi @Viraj_Doshi, welcome to the forum!

Your code looks mostly fine (apart from unnecessary reverse complement, see below). Note that you can make multiple biosample predictions in a single call, which would be quicker than doing them one at a time :smiley: Responses inline:

1. Interval preparation: Resizing the input genomic interval to dna_client.SEQUENCE_LENGTH_2KB before calling predict_interval is the correct approach.

Yes, we only support a sub-set of sequence lengths so you’ll need to resize to be able to make a prediction.

2. Output extraction: Slicing back to the original interval with slice_by_interval(…, match_resolution=True) and reversing the values for the negative strand is the appropriate way to align predictions (i.e. is the output of the model in the forward direction).

You don’t need to reverse complement: this will already be taken care for you (i.e. stranded tracks will be correctly reverse complemented before being returned).

3. The outputs are unprocessed (i.e. the outputs are the raw assay value predictions after reversing smoothening and scaling by non-zero average)? The targets I’m testing against have not been smoothened and divided by the non-zero-average.

I… think that’s correct? I’ll double check with colleagues and reply if that’s not the case!

4. I am using max sequence target lengths of ~350 base pairs. I have noticed that the per-base correlations specifically for DNAse-seq are much lower (0.2-0.4) than what was reported on the longer intervals in the AlphaGenome paper. I wanted to ask if your team had any measured correlations for independent cCRE sequences that had readings that align with the correlations reported in the paper.

Beyond the sequence length ablations in fig7 there’s not been a huge amount of analysis done at smaller sequence lengths. Nearly all of our benchmarks are done using the full 1Mb interval window centered around the region of interest, which we then crop.

Hope that helps!

Thanks! Waiting to hear back about question 3!

Hi, when I comment out the strand reversal,

# if strand == "-":
#    atac_values = atac_values[::-1]
#    dnase_values = dnase_values[::-1]

the correlation scores drop by a large amount (mean correlation drops by ~0.2, ATAC mean correlation drops by ~0.4 and DNase mean correlation drops by ~0.1). Are you sure this is supposed to be commented out? I have verified that our target data is correctly oriented.

Thank you

Just to clarify, our target for the -ve strands is also reversed

Just clarified and yes we return raw, un-normalized tracks.

Re strand: Yes we should return already reverse complemented tracks for any track on the negative strand. If you have a concrete example that you think is wrong, please do share and we’ll investigate further!