Val and test in FOLD_1 compared to Borzoi

Hi AlphaGenome team,

Quick clarification question regarding the train/val/test splits exposed by the AlphaGenome API versus the Borzoi splits used in benchmarks.

Using the API, I get the following fold assignments:

from alphagenome.data import fold_intervals
from alphagenome.models import dna_client

for mv in [dna_client.ModelVersion.FOLD_0,
           dna_client.ModelVersion.FOLD_1,
           dna_client.ModelVersion.FOLD_2,
           dna_client.ModelVersion.FOLD_3]:
    train = fold_intervals.get_fold_names(mv, fold_intervals.Subset.TRAIN)
    valid = fold_intervals.get_fold_names(mv, fold_intervals.Subset.VALID)
    test = fold_intervals.get_fold_names(mv, fold_intervals.Subset.TEST)
    print(f"{mv.name}: train={train}, valid={valid}, test={test}")

which yields:

FOLD_0: train=[fold2–fold7], valid=['fold0'], test=['fold1']
FOLD_1: train=[fold0,1,2,5,6,7], valid=['fold3'], test=['fold4']
FOLD_2: train=[fold0,1,3,4,6,7], valid=['fold2'], test=['fold5']
FOLD_3: train=[fold0–fold5], valid=['fold6'], test=['fold7']

In Borzoi, the convention (per the paper / released setup) is split3 = test and split4 = validation, whereas from the API it appears these are swapped (e.g. FOLD_1 uses fold3 as val and fold4 as test).

My question is simply:

  • Is this inversion intentional in AlphaGenome, or should the API folds exactly match Borzoi’s test/validation assignment?

Thanks a lot in advance !

more precisely, I’m wondering, for the test evaluation of the fold-wise model compared to Borzoi, was the AG reported test score obtained on fold4 or fold3 ? (I guess it should have been fold3 for to be comparable with Borzoi, but according to what the API returns seems it might have been fold4 ?)
Thanks a lot in advance!

It is possible that the validation and test set were swapped between Borzoi and AlphaGenome (fold 3 vs fold4).

I don’t think this changes any of the results in the paper. The most important thing is that the training set was the same. Furthermore, we have performed model iterations using the fold0 model. The fold1 model was trained without any consideration of the metrics in either validation or test set (we also haven’t used any early stopping). This means that both should be representative of the test-set performance.

Hope this helps!