Question on interpolating 2048 bp contact maps from 1000 bp Micro-C data

Dear AlphaGenome team,

Thank you for the impressive work on AlphaGenome. I have a question regarding the processing of contact maps during training.

In the paper, it is mentioned that Micro-C contact maps at 1000 bp resolution were interpolated to 2048 bp resolution to align with the pairwise representation blocks. However, I was wondering why the training pipeline does not instead reprocess the original Micro-C .pairs files directly into contact maps at 2048 bp resolution.

From a data fidelity perspective, would contact maps generated by interpolating 1000 bp data be equivalent to contact maps directly aggregated from .pairs files at 2048 bp resolution? Specifically:

  1. Are there known differences in signal quality, noise, or smoothing artifacts between the interpolated maps and those binned natively at 2048 bp?
  2. Would reprocessing .pairs files potentially improve accuracy or consistency in downstream model predictions?

I would greatly appreciate any insights into this design choice. Thank you again for making your work available to the community.

Best regards,
Yusen

Hi,

There are 2 reasons why the preprocessing of contact maps was done as 1000 bp.

  1. We followed the procedure described by Orca (Sequence-based modeling of genome 3D architecture from kilobase to
    chromosome-scale, Jian Zhou, 2021). There, the normalization and coarse-graining steps for the model that predicts with 1MB input context are done with 1000 bp contact maps. Note that the resolution at which coarse-graining happens has a large effect on the final contact maps.
  2. Our input intervals are not aligned to 2048 or 1000 bp boundaries. They are randomly shifted around as described in the Pretraining section of the methods. If we were to pre-process the data at 2048bp resolution we would interpolate twice: first from 1000bp of the raw data to 2048bp during preprocessing, then to 2048-aligned intervals to our input interval during training. This would decrease the fidelity of the training data compared to our approach.