Question on interpolating 2048 bp contact maps from 1000 bp Micro-C data

Yusen_Hou · August 4, 2025, 11:51am

Dear AlphaGenome team,

Thank you for the impressive work on AlphaGenome. I have a question regarding the processing of contact maps during training.

In the paper, it is mentioned that Micro-C contact maps at 1000 bp resolution were interpolated to 2048 bp resolution to align with the pairwise representation blocks. However, I was wondering why the training pipeline does not instead reprocess the original Micro-C .pairs files directly into contact maps at 2048 bp resolution.

From a data fidelity perspective, would contact maps generated by interpolating 1000 bp data be equivalent to contact maps directly aggregated from .pairs files at 2048 bp resolution? Specifically:

Are there known differences in signal quality, noise, or smoothing artifacts between the interpolated maps and those binned natively at 2048 bp?
Would reprocessing .pairs files potentially improve accuracy or consistency in downstream model predictions?

I would greatly appreciate any insights into this design choice. Thank you again for making your work available to the community.

Best regards,
Yusen

Guido_Novati · August 4, 2025, 4:10pm

Hi,

There are 2 reasons why the preprocessing of contact maps was done as 1000 bp.

We followed the procedure described by Orca (Sequence-based modeling of genome 3D architecture from kilobase to
chromosome-scale, Jian Zhou, 2021). There, the normalization and coarse-graining steps for the model that predicts with 1MB input context are done with 1000 bp contact maps. Note that the resolution at which coarse-graining happens has a large effect on the final contact maps.
Our input intervals are not aligned to 2048 or 1000 bp boundaries. They are randomly shifted around as described in the Pretraining section of the methods. If we were to pre-process the data at 2048bp resolution we would interpolate twice: first from 1000bp of the raw data to 2048bp during preprocessing, then to 2048-aligned intervals to our input interval during training. This would decrease the fidelity of the training data compared to our approach.

Topic		Replies	Views
Can't reproduce alphagenome's benchmarks Help & Support	9	2758	September 20, 2025
Validation of API usage Help & Support	5	1757	September 16, 2025
ContactMapsDiff Input Help & Support	1	639	July 23, 2025
Interpreting output prediction inconsistency based on the length of input interval Feedback & Feature Requests testing	1	1299	August 19, 2025
Recapitulating tissue-specific patterns Community	1	602	December 23, 2025

Question on interpolating 2048 bp contact maps from 1000 bp Micro-C data

Related topics