Reproducing TraitGym results

Thank you for your great contribution to the field! I’m trying to reproduce the TraitGym results:

  • I was able to score variants using score_variant and recommended scorers, but did not get close to the reported performance.
  • I then tried to reproduce the TraitGym protocol described in the supplement, using the lower-level predict_sequence . However, it was very notably slow and throwing errors related to data transfer quotas.

It would be amazing if you could support the “TraitGym” protocol (e.g. L2 norm, reverse-complement averaging) in a manner similar to the score_variant API.

Hi @Gonzalo_Benegas,

Thanks for the question! Yes we will add support for L2_DIFF_LOG1P aggregation type, which should be sufficient to reproduce the paper’s traitgym results (reverse complement should already be supported by passing an interval with negative strand).

Great to hear, thank you very much!

Hello!

Thanks for the recent addition of L2_DIFF_LOG1P aggregation! I still have some questions about the scorer:

Specifically, for each track predicted by a model, we first computed the predicted log-fold change in activity per position (or bin) due to the variant and then calculated the L2 norm across the sequence.

Does this mean you took L2 norm across the entire 1Mb for all assay types? Or did you use a center mask for some assays?

I’m guessing the protocol cannot be reproduced with CenterMaskScorer as it seems to support a max width of 200kb, which would probably lead to underperformance for RNA-seq.

Thank you for your help!

Apologies for slow reply, I’ve been OOO.

Yes I believe we took the L2 norm for the entire 1Mb region… I didn’t notice that we don’t support requests for full center mask scores, I’ll get that added ASAP.

That would be great, thanks a lot!

Hi @Gonzalo_Benegas,

With v0.2.0 you should now be able to make full center mask scores by providing a width=None to the center mask scorer. E.g.:

from alphagenome.data import genome
from alphagenome.models import dna_client
from alphagenome.models import variant_scorers

model = dna_client.create('API_KEY')

variant = genome.Variant.from_str('chr1:10000:A>G')
interval = variant.reference_interval.resize(2**20)

model.score_variant(
    interval,
    variant,
    variant_scorers=[
        variant_scorers.CenterMaskScorer(
            requested_output=dna_client.OutputType.ATAC,
            width=None,
            aggregation_type=variant_scorers.AggregationType.L2_DIFF_LOG1P,
        )
    ],
)

do let us know if we’ve missed anything else!

Perfect, thank you very much!