Hi, Im having some troubles extracting the protein sequences of missense mutations from annovar output files.
I would like to create all of the possible neopeptides arising from missense mutations of TCGA tumor samples. For this I used Annovar to get the genes the position and the aminoacid change for the mutations. The problem is Annovar only gives back the wt. and the mut. aminoacids and the position of the mutation in the peptide. My goal would be to create all possible nonamers caused by the missense mutation. (to calculate PHBR scores in the future). I assume if Annovar can tell the amino acid change and the position, at some point it has to deal with the whole protein sequence.
Do you have any idea how to assess this information?
So far I approached the problem with biomaRt and had the following issues:
- BiomaRt servers are sometimes unreachable
- Protein sequences on biomaRt are based on grch37 and I used grch38 ref. genome to run annovar (I cannot change that, for particular reasons i must use grch38)
- In many cases the wt. aminoacid indicated by annovar output does not match the protein sequences in biomaRt database. So Annovar tells me there is a E->K mutation in pos. 1146, there is no E in the wt. peptide in pos. 1146 (I checked all the splice variants, many times neither of them matches)
If it is not possible to get the sequences directly from anovar what would be the easiest and most punctual approach to assess protein sequences?
Thank you very much for your help, Benjamin