I'm using annovar for mapping 1KG data onto mRNA transcripts. Now, since I'm interested in nsSNPs, I would like to know the whole protein sequence. annovar provides the position, wt and mt residues however no specific information on the whole sequence. At least I couldn't find any. I parsed the mRNA RefSeq identifiers out of annovar's output and tried to find a mapping from those to RefSeq protein sequences. However, this often results in different wt amino acids between annovar's annotation and the residue that is found in the RefSeq sequence in that position. I wonder what the correct approach is? How does annovar perceive amino acids? By 'simple' translation of the mutant codon in the mRNA RefSeq file?
This sounds to me like cases we have often seen - the reference human genome, from which RefSeq mRNA and protein sequences were built, does contain minor alleles, even homozygous minor alleles in places. In other words, when looking at the consensus genome of six individuals, as was done to build the ref human genome, it is possible to have some minor alleles incorporated into the gene models. When looking at many more genome, however, it becomes clear that certain positions in a reference sequence are not representative of the major allele. An extreme case of this is some UGT2A and UGT2B gene family members are absent from some Asian populations.
So, you could do the mapping as you describe, but let Annovar overrule RefSeq for single residue discrepancies as long as the Annovar residue is based on sampling of many individuals. If you are working with data from a population/individual not of European origin, then you may need to consider comparisons to a reference genome from that other ethnic group.