Proposal to Avoid 1 Million BP Limit to Analyze Billion-Scale Genomes

Could we analyze genomes upwards of a billion base pairs by leveraging AlphaGenome’s ability to identify non-translating regulatory elements? Suppose there’s a protein whose expression we want to predict by finding what regions in the DNA regulate it. The first pass through AlphaGenome could tell us approximately which parts of that million-BP slice directly regulate our target protein and which parts may regulate something else in the genome through hallmark characteristics (Ex. high CAGE, high histone markers, high chromatin accessibility, changes from a genetic variant). In parallel, we’d do the same thing to other million-BP sequences/slices, but save to a storage file only the parts that regulate something else in the genome. Then, we could transplant the identified regulatory elements/parts into the protein target slice by replacing whatever didn’t directly regulate our target protein. This could tell us if the transplanted sections/parts do directly regulate our target protein. If they don’t, leave them in the storage file; we’ll use them in a later step.

This process of finding regulatory elements and testing them within the target’s slice can be run as many times as necessary in parallel. Each agent for each AlphaGenome instance independently takes slices from the genome, saves potential regulatory elements, and retrieves the trimmed protein target slice (non-directly regulating elements removed) to test its candidate with. Successful candidates that directly regulate our target protein are saved to another storage file, where they become a new target slice to find regulators for. Ultimately, we find what regions in the DNA regulate our protein by tracing back through the genome. Preventatives for never-ending regulatory elements might include a ledger of sequences that are known to be solely or primarily triggered by extracellular phenomena (thereby providing a starting point) or simply merging regulatory elements with the same function and regulator.

However, all of this may not be possible if there isn’t enough room in the target protein’s million-BP slice even after removing non-directly regulating parts. Also, there is no guarantee that random slices won’t slice regulatory elements in half. Overlapping two slices and comparing them to determine if something was cut off would significantly increase compute steps. Lastly, this likely will disrupt natural chromatin architecture unless the sequence spacing a regulator from the regulated element is identified as crucial. This is all speculative, but I’m nonetheless very excited about the future of AlphaGenome and the new architectures that Google DeepMind, other groups, and the community will develop. If you’ve made it here, thank you for reading my post. I’m eager to pursue more with anyone about anything!

2 Likes