Is full fine-tuning feasible on a single H100?

Hi everyone,

Thank you for the great work! I’m currently trying to fine-tune the entire AlphaGenome model on a new dataset using a single H100 GPU. Even with batch_size=1 and gradient accumulation, we run into memory limits very quickly.

Based on the distillation setup described in the paper—where both a frozen teacher and an unfrozen student are loaded and gradients are computed—we expected full fine-tuning to be possible.

Inference works without any issues. However, once we enable gradient computation on a 1M-bp input sequence, we immediately hit out-of-memory errors, even when training with only a single head.

Is full fine-tuning on sequences of this length feasible on a single H100? If so, what techniques or adjustments are typically required to avoid OOM in this setting?

Thanks in advance for any guidance.

Best,
Moon

Is it legal to finetune AG? From the license it seems that the answer is no