Ablation Study Clarification: Tiling and Interval Definition for Splice Junctions

Hi AlphaGenome team, I’m interested in recreating Fig 7b, the ablation study w.r.t. the sequence length on the junction counts track, and seek some clarification to exactly follow your protocol.

Firstly in the primary analyses, I noticed splice_site_positions are handled differently to targets of other modalities. On one hand, the sequence length of DNA and other targets are 1M and 196kb respectively (predictions are cropped at the boundaries because these positions see less context and could be less accurate).

But on the other hand, the Bx4x512 splice site positions range from 0 to 1M, not within the centre segment like other modalities. I suppose this is a necessity, else if it were masked to the middle 196kb, then it would never be able to predict 1M long junctions.

  1. This raises the question, for the junction counts in Fig 7b, how are the validation intervals tiled? One might reason it should be the input sequence length, since predictions are made across its entirety, instead of 3/16ths which would lead to overlaps in predictions. But the only reported exception to tiling by the 3/16ths length in Ablations Study was “Contact maps were tiled by the full sequence length”.
  2. In the ablation study, are the train target intervals still the Borzoi-defined ones? How are the train target intervals for shorter input lengths derived?
  3. Are the validation intervals based on the Borzoi fold definitions in the slightest (e.g. to mimic skipping certain chromosome regions), or literally, naively “tiled uniformly” from 0 to the end?

Thank you.

# loaded from 'gs:///alphagenome-datasets/v1/train/'
# metadata
{'interval/chromosome': array([b'chr19'], dtype=object), 'interval/start': array([29611252]), 'interval/end': array([30659828])}

# batch.splice_site_positions[0, :, :3]
array([[    4,   425,   686],
       [  609,   864,  1950],
       [91595, 93495, 94021],
       [88045, 88048, 91539]], dtype=int32)

# np.max(batch.splice_site_positions, axis=2)
array([[1032323, 1037071,  104310,  102199]], dtype=int32)