Interpreting output prediction inconsistency based on the length of input interval

Hi everyone,

I am trying to test-run and understand alphagenome prediction with the TG in different tissue context and different sizes of intervals (a gene selectively expressed in the thyroid gland). I have found some discrepancy in the predictions where my cumulative 1KB interval predictions across that 1MB interval result in a different prediction value than my 1MB prediction. This led me to believe that regulatory elements/ promoter binding motifs must be included in the prediction interval for a more accurate prediction. But this also led me to a couple of questions:

  1. Would 1MB interval prediction always be better than any smaller interval prediction, in order to include the most potential regulatory motifs? We can then narrow down the window for plotting instead of having a prediction on a shorter interval. But which prediction interval would be more “reliable", or more accurate?

  2. How shall I interpret the y-axis of the graph? Is the prediction a predicted “raw count” of RNA-seq reads, or a normalised reads? If it is normalised, what would it be normalised against? Are these predictions with different intervals comparable?

Thanks so much :slight_smile:

Best wishes,

Ken

1 Like

Hi Ken,

  1. A 1MB interval prediction is more reliable. As you say, the larger window captures more potential regulatory context, which is why summing smaller interval predictions won’t equal the prediction from a single large interval. Also, the model’s performance may improve when the prediction interval matches the 1MB sequence length it was trained on. Your best approach is to generate the prediction on the 1MB interval and then narrow the window for plotting.

  2. The y-axis represents the predicted read coverage for RNA-seq. The training data, and therefore the model’s predictions, are normalized to a common factor of 1 million reads multiplied by a common read length of 100. The units do not depend on the interval size.

1 Like