GeneMaskLFCScorer AggregationType

[CenterMaskScorer(requested_output=ATAC, width=501, aggregation_type=DIFF_LOG2_SUM), CenterMaskScorer(requested_output=DNASE,width=501,aggregation_type=DIFF_LOG2_SUM),
CenterMaskScorer(requested_output=CHIP_TF,width=501,aggregation_type=DIFF_LOG2_SUM),
CenterMaskScorer(requested_output=CHIP_HISTONE,width=2001,aggregation_type=DIFF_LOG2_SUM),
CenterMaskScorer(requested_output=CAGE, width=501,aggregation_type=DIFF_LOG2_SUM), CenterMaskScorer(requested_output=PROCAP,width=501,aggregation_type=DIFF_LOG2_SUM),
GeneMaskLFCScorer(requested_output=RNA_SEQ),
GeneMaskSplicingScorer(requested_output=SPLICE_SITES, width=None),
GeneMaskSplicingScorer(requested_output=SPLICE_SITE_USAGE, width=None),
SpliceJunctionScorer(),
PolyadenylationScorer()]

These are the recommended scorers provided. Notably the last five scorers don’t have an aggregation_type provided. Do you have any recommendations on what aggregation_type is best? I am curious because if I look at the contingency table of gene strand and track strand, it’s not immediately obvious what is going on within each track. I would also be curious to know what the track_strand = ‘.’ indicates?

1 Like

Hi @Joshua_Park, welcome to the community!

So the GeneMaskLFCScorer, SpliceJunctionScorer and PolyadenylationScorer don’t have any configurable aggregation which is why they’re not defined: we only have the ones we used in the paper available (see supplementary figures 12 and 13 for schematics on what aggregation is applied for each scorer).

As for the contingency table, the . strand indicates unstranded, and if you’re using the tidy_scores function, this by default removes scores with mis-matched genes and track strands, which would be why you have zeros for the mis-matched +/- entries.

Hope that helps!

1 Like

Hi @tward,

Thank you so much for your answer. I understand why aggregation methods aren’t provided. But, if I did still want to aggregate, do you have a suggested approach? Where I’m having trouble is extending the batch variant scoring tutorial: Batch variant scoring — AlphaGenome

After scoring multiple variants, I’m not exactly sure how one might compare them to each other. And so my initial thought was to try to get a single scalar value for each output_type for every variant. Which is why I am interested in aggregating all gene scores for every track into a single track score, and ultimately compute a single ‘output_type’ score. Is this something that’s been attempted/done previously?

So I guess you could take the ABS_MAX over the gene axis to attain the largest score for a gene region, and then I guess do the same for the track axis? It’s not something we typically do, but should somewhat provide a single score per output type.

You might also want to consider using the quantile scores in order to better compare scores across tracks and variant scorers, depending on the kind of analysis you’re performing.

Hope this helps!

1 Like