Hi, I want to calculate raw score for ATAC using different window width. The scorer function in AlphaGenome uses 501 width, so I thought to write my own calculation. But I want to check my logic (using 501 width) against what the scorer returns but it does not match.
- Using score variant:
# @title Score variant { run: "auto" }
organism = 'human' # @param ["human", "mouse"] {type:"string"}
organism_map = {
'human': dna_client.Organism.HOMO_SAPIENS,
'mouse': dna_client.Organism.MUS_MUSCULUS,
}
organism = organism_map[organism]
# @markdown Specify the variant:
variant_chromosome = 'chr22' # @param { type:"string" }
variant_position = 36201698 # @param { type:"integer" }
variant_reference_bases = 'A' # @param { type:"string" }
variant_alternate_bases = 'C' # @param { type:"string" }
variant = genome.Variant(
chromosome=variant_chromosome,
position=variant_position,
reference_bases=variant_reference_bases,
alternate_bases=variant_alternate_bases,
)
# @markdown Specify length of sequence around variant to predict:
sequence_length = '1MB' # @param ["2KB", "16KB", "100KB", "500KB", "1MB"] { type:"string" }
sequence_length = dna_client.SUPPORTED_SEQUENCE_LENGTHS[
f'SEQUENCE_LENGTH_{sequence_length}'
]
interval = variant.reference_interval.resize(sequence_length)
# @markdown Additional settings:
variant_scores = dna_model.score_variant(
interval=interval,
variant=variant,
variant_scorers=list(variant_scorers.RECOMMENDED_VARIANT_SCORERS.values()),
)
df_scores = variant_scorers.tidy_scores(variant_scores)
row_series = df_scores.iloc[[0]]
row_df = row_series.reset_index()
row_df[['variant_id','raw_score','ontology_curie','output_type']]
| variant_id | raw_score | ontology_curie | output_type | |
|---|---|---|---|---|
| 0 | chr22:36201698:A>C | -0.100227 | CL:0000084 | ATAC |
This is gives raw_score = -0.100227
- Using Predict Variant
# @markdown Additional settings:
sequence_length = '1MB' # @param ["2KB", "16KB", "100KB", "500KB", "1MB"] { type:"string" }
sequence_length = dna_client.SUPPORTED_SEQUENCE_LENGTHS[
f'SEQUENCE_LENGTH_{sequence_length}'
]
interval = variant.reference_interval.resize(sequence_length)
# @markdown Specify the variant:
variant_chromosome = 'chr22' # @param { type:"string" }
variant_position = 36201698 # @param { type:"integer" }
variant_reference_bases = 'A' # @param { type:"string" }
variant_alternate_bases = 'C' # @param { type:"string" }
variant = genome.Variant(
chromosome=variant_chromosome,
position=variant_position,
reference_bases=variant_reference_bases,
alternate_bases=variant_alternate_bases,
)
variant_output = dna_model.predict_variant(
interval=interval,
variant=variant,
requested_outputs = [dna_client.OutputType.ATAC],
ontology_terms=['CL:0000084']
)
import numpy as np
variant_output = dna_model.predict_variant(
interval=interval,
variant=variant,
requested_outputs = [dna_client.OutputType.ATAC],
ontology_terms=['CL:0000084']
)
ref = variant_output.reference.atac.values
alt = variant_output.alternate.atac.values
snp_pos = 36201698
half = 501 // 2 # 250
idx = snp_pos - interval.start
start = max(0, idx - half)
end = min(len(ref), idx + half +1)
start,end
ref_window = ref[start:end]
alt_window = alt[start:end]
# Sum over spatial axis (axis=0 if multi-track)
ref_sum = np.sum(ref_window, axis=0)
alt_sum = np.sum(alt_window, axis=0)
score = np.log2((alt_sum)) - np.log2((ref_sum))
print(score)
This is gives raw_score = -0.10555601
Have I done something incorrectly when I calculate this manually in the second step ? I want to make sure my calculation matches. Is there an eps value used?
Thanks,
Sanjana