Hi, I’m trying to predict gene expression using AlphaGenome. However, when I tried to predict specific gene of different cell line, I got different output track numbers. Here is an example:
For gene “SLC40A1“, I tested on PANC1 (‘EFO:0002713‘) and GM12878 (“EFO:0002784“), I followed the “quick start“ as mentioned officially. Within the module named “Predict outputs for a genome interval (reference genome)“, my command was as follows:
However, the output.rna_seq.values.shape were (1048576, 5) and (1048576, 3), repectively. Also, in the plotting part, the title of each track were different:
, for EFO:0002784 results, while only total (+), total (-), polyA+ (in order) for EFO:0002713.
My problems are:
Why there are different output track types? I suppose maybe because of the different types of RNA-seq data for training in different cell lines?
How could I identify which track relates to total RNA-seq (-) and total RNA-seq (+)? Because the order is not always the same in different cell line, as exampled above where polyA+ (+) came first for EFO:0002784, while total (+) came first for EFO:0002713.
Thanks for your questions! Please find responses below:
Why there are different output track types? I suppose maybe because of the different types of RNA-seq data for training in different cell lines?
I’m not sure about this specific cell line, but our paper methods section “ENCODE RNA-seq Data” details the selection process of RNA-seq tracks. Most likely the EFO:0002713 stranded polyA tracks didn’t pass our QC measures.
How could I identify which track relates to total RNA-seq (-) and total RNA-seq (+)? Because the order is not always the same in different cell line, as exampled above where polyA+ (+) came first for EFO:0002784, while total (+) came first for EFO:0002713.
The easiest way is to use the TrackData filter options to filter tracks. So for your example you can do:
Thank you very much! And by the way, could you please tell me whether this model could be used to predict gene expression of some other cell lines which are not recorded in the track metadata table (Suppl Table 2), such as HPDE, OE19 and so on?
Thank you very much! And by the way, could you please tell me whether this model could be used to predict gene expression of some other cell lines which are not recorded in the track metadata table (Suppl Table 2), such as HPDE, OE19 and so on?
Unfortunately not: the model can only make predictions for cell lines that it has seen during training.
Thank you. I’m wondering if the foundation model could be finetuned by ourselves through feeding gene expression level as the output for specific cell line?
For example, I’d like to use this model to predict gene expression in OE19, I build my own dataset with target importance genome intervals (such as promoter region, TSS, enhancer region, etc) and the corresponding expression level of the gene, and finetune the model.