Hello, thanks for releasing this model, very excited to play around with it!
Is there an easy way to find the most relevant ontology CURIE given a new experimental data? For example, let’s say there is a new experiment on ENCODE and I want to find the most relevant predictions tracks. How do I go about it? I see that there is a descriptions field which I can read and manually try to match the closest CURIE, but wondering if there is any easy programmatic way to do this?
Thank you so much in advance!
2 Likes
Thank you for your question and sorry for the delayed response!
Currently, AlphaGenome does not offer a built-in programmatic method to automatically map new experimental data to the most relevant ontology CURIE. The easiest way to view the available ontologies is using the tissue ontology colab.
For automation, you could try pulling the CURIE from the JSON associated with the ENCODE accession:
import requests
def get_encode_ontology_id(accession):
# ENCODE API returns JSON metadata
url = f"https://www.encodeproject.org/experiments/{accession}/?format=json"
response = requests.get(url).json()
# Extract the biosample ontology CURIE (e.g., 'UBERON:0002048')
return response['biosample_ontology']['term_id']
# Example
curie = get_encode_ontology_id('ENCSR473KVF')
print(f"Detected CURIE: {curie}")
Then you just need to pass this CURIE to AlphaGenome:
output = model.predict_interval(
interval=interval,
requested_outputs=dna_client.OutputType.DNASE,
ontology_terms=curie # <--- here
)
This may struggle if the project uses a specific child CURIE term, whereas only the parent term is available in AlphaGenome. For that your need a semantic fuzzy matching method based on the CURIE hierarchy. I believe you can consider tools like Pronto for this.
Best
Amanda