Validity of raw vs. quantile scores for promoter variant effects when gene body extends beyond 16 kb sequence window

We are using the batch variant scoring function to predict the effects of promoter variants. For our analyses, we set sequence_length = 16 kb, so each variant is evaluated within a 16 kb sequence window. Our goal is to estimate the effect of a promoter variant on the closest gene, where the gene’s promoter and TSS are within the 16 kb window. We are mainly using the RNA-seq predictions for this purpose.

However, in some cases, the target gene’s full gene body extends beyond the 16 kb input window. Given this setup, we wanted to ask:

Are the raw RNA-seq scores or the quantile scores still valid for interpreting the variant’s effect on the closest gene, even if the full gene body is not contained within the 16 kb sequence window?

More specifically, should we be concerned that the RNA-seq prediction may be incomplete or biased when the promoter/TSS is inside the window but the gene body extends outside of it? In this case, is one score type, the raw score or the quantile score, more appropriate or reliable for interpretation?

Thank you very much for your guidance.

Hi there,

Thank you for reaching out. Regarding your questions on sequence length and scoring interpretation, please see the guidance below:

I recommend using the full 1Mb context window for your analyses. Only bases within the context window can be predicted; to predict transcription for their gene effectively, the full gene should be included in the context window if possible. Also, model performance is best when using the full 1MB.

The raw score is a measure of the different between the reference and alternative predictions. Its calculated using various scorer methods, that are specific to the modality of interest. The quantile score is a standardized measure of predicted impact across all modalities and biosamples, calculated by calibrating raw variant scores against a background distribution of 348k common human SNPs. The quantile score represents a variant’s specific percentile rank relative to this common variant baseline and can be used to compare variant effect between biosamples and modalities.

For example, if you are interested in how a variant effects transcription, you may use the RNA recommended scorer, which will yield a variant raw and quantile score for each gene within your context window across biosamples. If you choose to include other scoring methods (E.g for ATAC), the quantile score will allow for comparison between the modalities.

Kind regards,

Nicolene