I have downloaded SNP data from the 1000 genomes project through Biomart and UCSC genome browser. These SNP data are annotated as being synonymous or non-synonymous (missense). In all textbooks it is said the the number of synonymous mutations should be much higher than non-synonymous mutations. Then why is it that I consistently observe higher number of non-synonymous SNPs for the human genome? Do you think there might be a mistake in annotating these SNPs or there is something else that I am missing?
There are two forces in play here: mutation rate, which introduces new variants, and natural selection, which removes deleterious (harmful) variants.
Because more substitutions create missense rather than synonymous substitutions, we expect most new coding mutations to be missense. However, missense substitutions are more likely to be harmful and thus removed from the population, so the RATE OF OBSERVED SUBSTITUTION PER SITE (which I assume is what your textbooks are referring to) is always higher at synonymous sites. However, the OBSERVED NUMBER OF VARIANTS will often be higher for missense than synonymous variants, especially once you start digging down into the low-frequency variants that haven't had much of a chance to be affected by natural selection, as is the case for the 1000G data.
[Added in edit: worth mentioning that a similar pattern is seen in other large human sequence data-sets.]
Does the "Validated by 1000 Genomes filter" mean validated, or called? 1000 Genomes chose to experimentally validate a very specific set of SNPs, which was not intended to be a random sample of all SNPs.
I think what you want to do is take a "discovered by 1000g" or "called in 1000g" filter, Your numbers are too low to be total numbers of SNPs called in 1000g which number in the millions. So - bottom line - you were quite right to be suspicious, and although I have not checked specifically, I'm sure you'll find more non-syn than syn SNPs once you download the correct set. I'm not a UCSC expert, so I'm afraid I can't help you actually do it though best Zam