About 64 Models in the Teacher Ensembl

In the paper it says “Additionally, 64 models were trained using all available reference genome intervals (all-folds)”.

May I asked how it ended up in 64 from 4-fold cross-validation? Or what are the differences between these 64 models?

Thank you so much!

Hi!
The 64 “all-folds” models are separate from the four cross-validation models. They were trained on all the data without a holdout set.

The difference between these 64 models comes from different random initializations during training, leading to distinct final models.

Am I right that the 4-fold cross-validation loop used to detertmine the best hyperparameter configuration, and then all the 64 “all-folds” models are trained on this SAME hyperparameter configuration accordingly but with different random initializations? Thank you so much!

Yes, all pre-trained (i.e. not distilled) models were obtained with the same set of hyper-parameters and different random seeds. These hyper-parameters were found based only on fold-0 trained models. The design space is very large so only few of them are in the ballpark of the optimal ones.