Request for Accessible Variant Benchmark Data or Clarification on eQTL Dataset Curation

Hi AlphaGenome Team,

I am attempting to reproduce the benchmark results from your manuscript. However, I am finding it challenging to curate the benchmark datasets based solely on the description in the paper. It would be extremely helpful if you could provide the benchmark data in an easily accessible format, such as TSV or VCF files.

If providing the full benchmark is not convenient, I would greatly appreciate clarification on some details regarding the eQTL coefficient prediction benchmark. Specifically, I have curated approximately 50,000 data points, which is substantially larger than the n = 17,675 mentioned on page 48 of your manuscript. Below, I outline my curation pipeline and hope you can help identify where the discrepancy might arise.

My Data Curation Pipeline:

Step 1: I downloaded the eQTL causality dataset (97,922 entries) from the Borzoi GitHub repository link: https://console.cloud.google.com/storage/browser/borzoi-paper/qtl/eqtl

Step 2: I matched this causality dataset with the SuSiE fine-mapping results provided in the Enformer repository: https://console.cloud.google.com/storage/browser/dm-enformer/data/gtex_fine/susie

This resulted in 195,714 variant/gene/tissue combinations (the increase is due to some variants being associated with different genes).

Step 3: After filtering for PIP > 0.9, I retained 53,316 entries, which is still much larger than the 17,675 mentioned in the AlphaGenome manuscript.

Could you help identify any issues with my pipeline that might explain this discrepancy?

Thank you very much for your time and help.

As a follow-up clarification: I previously forgot to subset the data by your test set chromosomes. However, even after subsetting to the test chromosomes, my sample size is n = 19,096, which still differs from your reported 17,675.

I have a new question: In your paper, you mention that the eQTL causality benchmark contains variant/gene/tissue tuples. However, the causality benchmark data available at https://console.cloud.google.com/storage/browser/borzoi-paper/qtl/eqtl appears to only include variant/tissue pairs. Could you clarify where the gene information in the causality benchmark comes from?