Hi, guys:
Based on the central dogma, it seems quite easy and straightforward to infer the mRNA sequence and therefore amino acid sequence from a DNA sequence. However, due to alternative splicing, this task becomes uncertain.
I have been using this https://github.com/ikmb/**vcf2prot** tool to infer amino acid sequence from DNA variation data (in VCF format). For example, below is my script to infer the ABO protein amino acid sequence for the ~3,000 individuals with high coverage sequencing data from the 1000 genomes project.
chr=9; pos1=133233278; pos2=133276024; prot=ABO # GRCh38 position for ABO gene
loc=$chr:$pos1-$pos2bcftools view 1kg/high_coverage/chr$chr.vcf.gz -Ou --regions chr$loc | bcftools annotate -Oz -o $prot.vcf.gz --rename-chrs chr.prefix; tabix -f $prot.vcf.gz
bcftools csq $prot.vcf -Ov -o $prot.csq.vcf.gz -f Homo_sapiens.GRCh38.dna.chromosome.$chr.fa -g Homo_sapiens.GRCh38.114.chromosome.$chr.gff3.gz; tabix -f $prot.csq.vcf.gz
vcf2prot -f $prot.csq.vcf -r software/vcf2prot/examples/reference_sequences.fasta -v -g st -o out
The output .fasta files for the first two individuals are posted at my Github foler 001/alphag at master · jielab/001 · GitHub .
Somehow, the second person has an empty HG00097.fasta file. The VCF2PROT software developer said that this is because this person has NO mutations in any of the missense SNPs in the ABO gene, compared to the reference genome. Therefore, his/her amino acid sequence for ABO is identical to that of the reference. But in reality, how could this be even possible, since this HG00097 person is certainly not a twin of the human genome reference?
As shown in HG00096.fasta, the first person has 4 transcripts in two sets. One set has 353 amino acids, and the other has 373 amino acids. So far, I have no way to be sure whether it is 353 or 373. Should I do a RNAseq to confirm?
Will AlphaGenome be able to perform this task and potentially output a more definite result?
Thank you & best regards,
Jie