howdy! I was hoping to evaluate AlphaGenome’s full output on an external benchmark, GUANinE. Unfortunately, querying all ~5k output tracks at full 1Mb context size with predict_interval seems to be rather slow (at 1 inference/minute).
I believe this is due to the (enormous) amount of data serialized & sent to the client – one vector of tracks per each of the ~ million base pairs. Is there any way to either
a) have AlphaGenome average_pool the tracks at specifiable interval widths (e.g. center, center+/- 1bp, center +/-2 bp, center +/-4 bp, … center +/- 500k bp) or
b) send a lossy, low-rank (i.e. PCA component) approximation of the outputs?
one (or both) of these could substantially improve throughput (which again, appears to be a networking rather than computational bottleneck since variant-based throughput is closer to 24 inferences/min).
I ask because even the smallest task of GUANinE, dnase_propensity, requires ~ 105k inferences for a few-shot evaluation (which would take about 10 weeks of client runtime
; and multi-threading hits the Mb rate limit quota)