Question

How to find out codon changes in non-synonymous and synonymous SNPs

0

Entering edit mode

10.0 years ago

Ric ▴ 430

I used snpEff and have got the results vcf file.

How is it possible to find out the most common codon changes i.e (CCG (Proline) to CCA (Proline)) and their number of events (i.e 300) in non-synonymous and synonymous SNPs?

snpeff snpsift SNP effect • 5.4k views

ADD COMMENT • link updated 10.0 years ago by DG 7.3k • written 10.0 years ago by Ric ▴ 430

Ram · Answer 1 · 2014-05-20

Can you paste a line from your snpEff output. The snpEff output file that I have can be easily parsed using awk one-liner.

grep "SYNONYMOUS"  input.snpeff |  awk '{split($0,a,"|"); print a[3]}' | awk '{split($0,b,"/"); print b[1],"\t",b[2]}'

produces the following result:

tAt      tTt
cTt      cGt
ggT      ggC
Cga      Aga
acG      acT
acA      acT

grep "SYNONYMOUS" takes care of both synonymous and non-synonymous snps. You can take the output then and do the counting. Is this what you need.

Ram · Answer 2 · 2014-05-20

Keep in mind that your INFO field with the snpEFF annotations, depending on what organism/databases you are using to annotate with, can have multiple predicted effects. So if you are dealing with human data for instance you get various annotations due to multiple transcripts overlapping a position which can have different impacts.

You can use awk and grep in combination as @Ashutosh recommended. You can also use something PyVCF to parse your VCF file programmatically, although you will have to parse the INFO field yourself to parse the snpEFF effect(s). If you are dealing with model organisms data you could also use a tool like GEMINI to parse out the top scoring impact per variant for you and have everything stored in an sqlite3 database which you can then use to do your counts.

Quite a few different ways to approach this problem depending on your level of programming comfort and what system you are working in.