In the paper it says “Additionally, 64 models were trained using all available reference genome intervals (all-folds)”.
May I asked how it ended up in 64 from 4-fold cross-validation? Or what are the differences between these 64 models?
Thank you so much!
Hi!
The 64 “all-folds” models are separate from the four cross-validation models. They were trained on all the data without a holdout set.
The difference between these 64 models comes from different random initializations during training, leading to distinct final models.
Am I right that the 4-fold cross-validation loop used to detertmine the best hyperparameter configuration, and then all the 64 “all-folds” models are trained on this SAME hyperparameter configuration accordingly but with different random initializations? Thank you so much!
Yes, all pre-trained (i.e. not distilled) models were obtained with the same set of hyper-parameters and different random seeds. These hyper-parameters were found based only on fold-0 trained models. The design space is very large so only few of them are in the ballpark of the optimal ones.