I have VCF files containing SNPs called on different samples from mRNA sequencing data. There's usually the SNP position, the reference and alternative allele and, among other things, the effect of the mutation on the corresponding amino acid.
Here's the thing. If 2 SNPs are located next (1 or 2 bp apart) to each other, the caller doesn't necessarily output the right a.a as it considers only one SNP at a time. That might be right for a sample showing only one of the two mutations, but if a sample have 2 (or 3 actually) adjacent mutations, it can lead to false results in the predicted amino acid.
I am developing a script to tackle this issue but was wondering two things:
- Is there any script / software already written to solve this problem?
- If not, I will continue with my in-house script. Another problem arise: I have the position of the two SNPs, but have to find the 3rd base on the codon or the nearby ones (depending on the reading frame, it can cover only 1 or 2 codons). I can't really look it directly on the reference fasta file as I won't know the reading frame. So I have to use the GFF, to get the transcript, account for all the introns/exons lengths and then calculate the real reading frame to then get the actual missing bases and finally check for any new amino acid outputted considering the 2 adjacent SNPs. It's quite complicated, especially as my programming skills aren't great. Any easier method you could think of?
It would be easier to just manually check all the close SNPs with a genome browser and a codon table, but the whole point of that is to automate things a bit (when you have dozens of pairs it can get tedious).
Thanks a lot for your help!