Issue with remote genome fasta files

Hi,

I am running into an issue where the creation of the model fails because the genomic fasta file seems to be in a remote loacation that I am assuming is non-accessible to my script. It seems that when creating the model the package is trying to open a remote file in the googleapis website and that’s impeding me to load the model. This is the relevant line I think.

My script is a trial, comes from the hugging face website for now and looks something like this:

from alphagenome.data import genome
from alphagenome.visualization import plot_components
from alphagenome_research.model import dna_model
import matplotlib.pyplot as plt

# dna_model.OrganismSettings

import pdb

model = dna_model.create_from_huggingface('all_folds')

pdb.set_trace()

The error looks something like this:

File “/home/david.mas/sandbox/20260311_agtmp/.venv/lib/python3.12/site-packages/pyfaidx/init.py”, line 473, in initraise FastaNotFoundError(“Cannot read FASTA from OpenFile %s” % filename)pyfaidx.FastaNotFoundError: Cannot read FASTA from OpenFile <OpenFile ‘https://storage.googleapis.com/alphagenome/reference/gencode/hg38/GRCh38.p13.genome.fa’>

I have verified however that the file is accessible from my pc so I would not say it’s a network issue.

Just to clarify my setting. Running alphagenome in a HPC cluster (installed alphagenome_research through uv-package manager from the git alphagenome_research).

This is obviously something wrong with my setting, i was just wondering if anyone has encountered a simiar issue? or if there is a way to bypass this with local files?

Hi There!

Thanks for reaching out.

Regarding the issue with the remote FASTA file access, it is recommended to download the FASTA files and GTFs to your local machine.

To bypass the network issues, you can then pass in custom OrganismSettings to point directly to these downloaded local files. You can find the relevant implementation details for passing these settings in the dna_model.py source code here: alphagenome_research/src/alphagenome_research/model/dna_model.py at main · google-deepmind/alphagenome_research · GitHub

Kind regards,
Tumi

Thanks Tumi, I am downloading the files now and will check. By any chance, do you have a example on how to define a new OrganismSettings with the local paths.

Okay, I think I managed to make it work. Here is my snippet for future reference.

## imports
import jax
from alphagenome.data import genome
from alphagenome_research.model import dna_model
from alphagenome.models import dna_client
#from alphagenome.models.dna_output import OutputType
from pathlib import Path

# params
params_device = "cpu"
params_resourcespath = "/path/to/alphagenome_resources"

# ++++ LOCAL path definitions ++++ 
assert Path(params_resourcespath + "/HUMAN/GRCh38.p13.genome.fa").exists(), "FASTA file not found"
## check if the fai file exists, if not create it
if not Path(params_resourcespath + "/HUMAN/GRCh38.p13.genome.fa.fai").exists():
    print("FASTA index file not found, creating it...")
    import subprocess
    subprocess.run(["samtools", "faidx", params_resourcespath + "/HUMAN/GRCh38.p13.genome.fa"])

assert Path(params_resourcespath + "/HUMAN/gencode.v46.annotation.gtf.gz.feather").exists(), "GTF feather file not found"
assert Path(params_resourcespath + "/HUMAN/polyadb_human_v3_exon3_contiguous_gtfv46.feather").exists(), "PAS feather file not found"
assert Path(params_resourcespath + "/HUMAN/gencode.v46.splice_sites_starts.feather").exists(), "Splice site starts feather file not found"
assert Path(params_resourcespath + "/HUMAN/gencode.v46.splice_sites_ends.feather").exists(), "Splice site ends feather file not found"

human_settings = dna_model.OrganismSettings(
    fasta_path=Path(params_resourcespath + "/HUMAN/GRCh38.p13.genome.fa"),
    gtf_feather_path=Path(params_resourcespath + "/HUMAN/gencode.v46.annotation.gtf.gz.feather"),
    pas_feather_path=Path(params_resourcespath + "/HUMAN/polyadb_human_v3_exon3_contiguous_gtfv46.feather"),
    splice_site_starts_feather_path=Path(params_resourcespath + "/HUMAN/gencode.v46.splice_sites_starts.feather"),
    splice_site_ends_feather_path=Path(params_resourcespath + "/HUMAN/gencode.v46.splice_sites_ends.feather"),
)

assert Path(params_resourcespath + "/MOUSE/GRCm38.p6.genome.fa").exists(), "FASTA file not found"
## check if the fai file exists, if not create it
if not Path(params_resourcespath + "/MOUSE/GRCm38.p6.genome.fa.fai").exists():
    print("FASTA index file not found, creating it...")
    import subprocess
    subprocess.run(["samtools", "faidx", params_resourcespath + "/MOUSE/GRCm38.p6.genome.fa"])

assert Path(params_resourcespath + "/MOUSE/gencode.vM23.annotation.gtf.gz.feather").exists(), "GTF feather file not found"
assert Path(params_resourcespath + "/MOUSE/gencode.vM23.splice_sites_starts.feather").exists(), "Splice site starts feather file not found"
assert Path(params_resourcespath + "/MOUSE/gencode.vM23.splice_sites_ends.feather").exists(), "Splice site ends feather file not found" 

mouse_settings = dna_model.OrganismSettings(
    fasta_path=Path(params_resourcespath + "/MOUSE/GRCm38.p6.genome.fa"),
    gtf_feather_path=Path(params_resourcespath + "/MOUSE/gencode.vM23.annotation.gtf.gz.feather"),
    splice_site_starts_feather_path=Path(params_resourcespath + "/MOUSE/gencode.vM23.splice_sites_starts.feather"),
    splice_site_ends_feather_path=Path(params_resourcespath + "/MOUSE/gencode.vM23.splice_sites_ends.feather"),
)

settings = {
    dna_model.Organism.HOMO_SAPIENS: human_settings,
    dna_model.Organism.MUS_MUSCULUS: mouse_settings,
}

# ++++ device selection ++++ 

if params_device == "cpu":
    print("CPU device selected")
    device = jax.devices("cpu")[0]
else:
    print("Defaulting into GPU")


# model retrieval -------------
model = dna_model.create_from_huggingface('all_folds',
                                          organism_settings=settings,
                                          device = device)

print("Model selected")

interval = genome.Interval('chr22', 36_150_498, 36_252_898).resize(
    dna_client.SEQUENCE_LENGTH_16KB
)

# Define the tissues/cell-types to predict expression for.
ontology_terms = [
    'UBERON:0001159',  # Colon - Sigmoid.
    'UBERON:0001155',  # Colon - Transverse.
]

# Make predictions.
output = model.predict_interval(
    interval=interval,
    requested_outputs={
        dna_client.OutputType.ATAC,
        dna_client.OutputType.RNA_SEQ,
        dna_client.OutputType.CAGE,
    },
    ontology_terms=ontology_terms,
)