Hi everyone,
Thank you for the great work! I’m currently trying to fine-tune the entire AlphaGenome model on a new dataset using a single H100 GPU. Even with batch_size=1 and gradient accumulation, we run into memory limits very quickly.
Based on the distillation setup described in the paper—where both a frozen teacher and an unfrozen student are loaded and gradients are computed—we expected full fine-tuning to be possible.
Inference works without any issues. However, once we enable gradient computation on a 1M-bp input sequence, we immediately hit out-of-memory errors, even when training with only a single head.
Is full fine-tuning on sequences of this length feasible on a single H100? If so, what techniques or adjustments are typically required to avoid OOM in this setting?
Thanks in advance for any guidance.
Best,
Moon