Question

Why Are There More Non-Synonymous Snps Than Synonymous Snps In The 1000 Genomes Data?

10

Entering edit mode

11.8 years ago

Ali R. Vahdati ▴ 190

I have downloaded SNP data from the 1000 genomes project through Biomart and UCSC genome browser. These SNP data are annotated as being synonymous or non-synonymous (missense). In all textbooks it is said the the number of synonymous mutations should be much higher than non-synonymous mutations. Then why is it that I consistently observe higher number of non-synonymous SNPs for the human genome? Do you think there might be a mistake in annotating these SNPs or there is something else that I am missing?

1000genomes snp • 11k views

ADD COMMENT • link updated 11.8 years ago by Dgmacarthur ▴ 310 • written 11.8 years ago by Ali R. Vahdati ▴ 190

1

Entering edit mode

Can you give us more details about what exactly you are downloading, so that we might check it ourselves. One possible explanation, I may be wrong so someone correct me if I am, but rare variants are enriched for being damaging/non-synonomous. The idea being silent mutations become more frequent in the population wher.e as deleterious ones will remain infrequent. Since the 1000genomes project are looking at rare variants this could be one reason. I am not 100% confident on this, so perhaps someone with more knowledge can say more

ADD REPLY • link 11.8 years ago by Davy ▴ 410

0

Entering edit mode

Thank you. One of the data I have downloaded: UCSC table browser, human genome assembly hg19, All SNPs(135), filter: validated by 1000genomes and function is missense versus the same table except the function is set to be synonymous.

ADD REPLY • link 11.8 years ago by Ali R. Vahdati ▴ 190

0

Entering edit mode

try removing the function filter and check again. I'm downloading it now. I'll post my answer as soon as I get the data.

ADD REPLY • link 11.8 years ago by Davy ▴ 410

0

Entering edit mode

Here are the results when removing the function filter and counting synonymous and non-synonymous SNPs: non-synonymous SNPs = 161737 and synonymous SNPs = 124014

ADD REPLY • link 11.8 years ago by Ali R. Vahdati ▴ 190

3

Entering edit mode

I got same for coding-synon and for missense as you. From the different combination of entries in the function field, it seems that a lot of SNPs contain even contradictory annotations for instance, 6000+ SNPs are listed as being coding synonymous AND intronic, which is obviously not possible. That said, I'm not sure how much faith I would place in these annotations. You could try stripping off the functional annotations, and making a new file of the positions and using the annovar program with hg19 to give you up to date fresh annotations. But just for my own peace of mind, I can't guarantee that this will yield a different result. But would be interesting to see it compared.

ADD REPLY • link updated 11.8 years ago by Istvan Albert 100k • written 11.8 years ago by Davy ▴ 410

6

Entering edit mode

It certainly IS possible for a SNP to have two different annotations, due to alternative splicing: in these cases, the SNP is synonymous in some transcripts of a gene, and not incorporated in others (i.e. is intronic).

ADD REPLY • link 11.8 years ago by Dgmacarthur ▴ 310

0

Entering edit mode

If a snp is intronic, then it is in an intron, and therefore will never be incorporated into a transcript, regardless of alternative splicing. Are exons that are not incorporated into the transcript also called introns? Sorry, a little confused now.

ADD REPLY • link 11.8 years ago by Davy ▴ 410

1

Entering edit mode

In alternative splicing there are multiple transcripts, so the multiple annotations refer to each possible transcript. For example, look at the top graphic on the Alternative Splicing Wikipedia page. If you had a SNP in the yellow alternative exon, then it would have two annotations: on the left transcript -- exon, and on the right transcript -- intron.

ADD REPLY • link 11.8 years ago by Brad Chapman 9.7k

0

Entering edit mode

I thought that, to continue with the wiki example, introns are the black lines, and so any variant that is contained within the intron could never be included in any transcript, and that in the context of alternative splicing there is "exon shuffling" but wether or not they are included in the transcript, they are still exons, and that the two are mutually exclusive. But this is not the case it seems? Sorry to go off topic slightly. I just wanted clarification so I can update my understanding on how these annotations work.

ADD REPLY • link 11.8 years ago by Davy ▴ 410

1

Entering edit mode

Exon/intron only have meaning relative to the specific transcript you're looking at. So in the left transcript the yellow box is an exon and the green box is an intron (spliced out). In the right transcript the green box is an exon and the yellow box is an intron. Even though it is convenient to think of a condensed gene, for the purpose of considering the impact of a change you need to consider each transcript independently. This is why a variation can have multiple annotations.

ADD REPLY • link 11.8 years ago by Brad Chapman 9.7k

0

Entering edit mode

Right. In which case I retract what I previously said and apologise for the misinformation. Thanks Brad and Dgmacarthur. Now to go back to school.

ADD REPLY • link 11.8 years ago by Davy ▴ 410

0

Entering edit mode

Thanks for introducing Annovar program. I will check annotations again and put here the results.

ADD REPLY • link updated 11.8 years ago by Istvan Albert 100k • written 11.8 years ago by Ali R. Vahdati ▴ 190

Ram · Answer 1 · 2012-07-09

11

Entering edit mode

11.8 years ago

Dgmacarthur ▴ 310

There are two forces in play here: mutation rate, which introduces new variants, and natural selection, which removes deleterious (harmful) variants.

Because more substitutions create missense rather than synonymous substitutions, we expect most new coding mutations to be missense. However, missense substitutions are more likely to be harmful and thus removed from the population, so the RATE OF OBSERVED SUBSTITUTION PER SITE (which I assume is what your textbooks are referring to) is always higher at synonymous sites. However, the OBSERVED NUMBER OF VARIANTS will often be higher for missense than synonymous variants, especially once you start digging down into the low-frequency variants that haven't had much of a chance to be affected by natural selection, as is the case for the 1000G data.

[Added in edit: worth mentioning that a similar pattern is seen in other large human sequence data-sets.]

ADD COMMENT • link 11.8 years ago by Dgmacarthur ▴ 310

2

Entering edit mode

I agree with Daniel's explanation. A simple check would be to threshold the 1000G calls by some minor allele frequency (~1-5%) to check that you recover the expectation.

ADD REPLY • link 11.8 years ago by Adam ★ 1.0k

0

Entering edit mode

I have annotated all Hapmap SNPs fresh by Ensembl VEP. Hapmap SNPs are common with MAF of 5% and above. Out of 1560681 SNPs, 13508 were non-synonymous and 13767 were synonymous. So the syn/nonsyn ratio has changed for these high-freq SNPs, which is a confirmation of Daniel's explanation. The differences between syn and nonsyn still seems minor though.

ADD REPLY • link 11.8 years ago by Ali R. Vahdati ▴ 190

0

Entering edit mode

in my study, I have taken from 1000g only the most frequent SNPs with a MAF of 0,1 and above for the oxphos complexes and still there are differences in the numbers. perhaps it has something to do with the procedure (oxphos) that causes that to happen

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by gwas_maniac ▴ 20

score 4 · Answer 2 · 2012-07-09

4

Entering edit mode

11.8 years ago

zam.iqbal.genome ★ 1.8k

Does the "Validated by 1000 Genomes filter" mean validated, or called? 1000 Genomes chose to experimentally validate a very specific set of SNPs, which was not intended to be a random sample of all SNPs.

I think what you want to do is take a "discovered by 1000g" or "called in 1000g" filter, Your numbers are too low to be total numbers of SNPs called in 1000g which number in the millions. So - bottom line - you were quite right to be suspicious, and although I have not checked specifically, I'm sure you'll find more non-syn than syn SNPs once you download the correct set. I'm not a UCSC expert, so I'm afraid I can't help you actually do it though best Zam

ADD COMMENT • link 11.8 years ago by zam.iqbal.genome ★ 1.8k

0

Entering edit mode

I am not completely sure, but I think that what they call validated by the 1000 genomes actually means called by the 1000 genome. The reason being that even in their paper (2010) they found 68,300 non-synonymous SNPs, 34,161 of which were novel. So, the number of SNPs are so much less than millions because of being limited to non-synonymous or synonymous ones, maybe. Also, when I downloaded SNP data from Biomart, regardless of their validation status, I found only 530427 unique non-synonymous SNPs.