Recapitulating tissue-specific patterns

Hi everyone!

I‘m trying to understand why it‘s challenging for AlphaGenome to pick up on tissue-level patterns extremely well. Especially considering the training signals are available. I had the following questions:

  1. Beyond training for 15,000 batches, what was the impact on cell/tissue level tracks vs. other tasks? Could it be that more training is required to fit tissue-level patterns (only)?

  2. If additional training doesn‘t help, could it be possible that the model is cueing into causally irrelevant signals to minimize loss? Perhaps due to simplicity bias or limited data that isn‘t enough to learn these causally relevant (but complex) signals?

Thank you for your time!

1 Like

Hello,

Thank you for reaching out, and apologies for the delayed response.

  1. The training duration of 15,000 steps was selected to balance performance on both reference genome prediction tasks and zero-shot variant effect prediction tasks, as evaluated on validation data subset. Training for longer would improve reference genome predictions, including the tissue-level patterns you mention.
  2. It is very plausible that the model is investing some representation capacity to reproduce causally irrelevant signals. We have not investigated this, but for example removing assay-specific enzyme biases from the training data is a promising and exciting research direction (e.g. see ChromBPNet). As you mention, increasing the amount of data could also potentially help.

Thank you!

1 Like