Question: Why Are There More Non-Synonymous Snps Than Synonymous Snps In The 1000 Genomes Data?
10
gravatar for Ali R. Vahdati
6.0 years ago by
Zurich, Switzerland
Ali R. Vahdati180 wrote:

I have downloaded SNP data from the 1000 genomes project through Biomart and UCSC genome browser. These SNP data are annotated as being synonymous or non-synonymous (missense). In all textbooks it is said the the number of synonymous mutations should be much higher than non-synonymous mutations. Then why is it that I consistently observe higher number of non-synonymous SNPs for the human genome? Do you think there might be a mistake in annotating these SNPs or there is something else that I am missing?

1000genomes snp • 7.3k views
ADD COMMENTlink written 6.0 years ago by Ali R. Vahdati180
1

Can you give us more details about what exactly you are downloading, so that we might check it ourselves. One possible explanation, I may be wrong so someone correct me if I am, but rare variants are enriched for being damaging/non-synonomous. The idea being silent mutations become more frequent in the population wher.e as deleterious ones will remain infrequent. Since the 1000genomes project are looking at rare variants this could be one reason. I am not 100% confident on this, so perhaps someone with more knowledge can say more

ADD REPLYlink written 6.0 years ago by Davy360

Thank you. One of the data I have downloaded: UCSC table browser, human genome assembly hg19, All SNPs(135), filter: validated by 1000genomes and function is missense versus the same table except the function is set to be synonymous.

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by Ali R. Vahdati180

try removing the function filter and check again. I'm downloading it now. I'll post my answer as soon as I get the data.

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by Davy360

Here are the results when removing the function filter and counting synonymous and non-synonymous SNPs: non-synonymous SNPs = 161737 and synonymous SNPs = 124014

ADD REPLYlink written 6.0 years ago by Ali R. Vahdati180
3

I got same for coding-synon and for missense as you. From the different combination of entries in the function field, it seems that a lot of SNPs contain even contradictory annotations for instance, 6000+ SNPs are listed as being coding synonymous AND intronic, which is obviously not possible. That said, I'm not sure how much faith I would place in these annotations. You could try stripping off the functional annotations, and making a new file of the positions and using the annovar program with hg19 to give you up to date fresh annotations. But just for my own peace of mind, I can't guarantee that this will yield a different result. But would be interesting to see it compared.

ADD REPLYlink modified 6.0 years ago by Istvan Albert ♦♦ 77k • written 6.0 years ago by Davy360
6

It certainly IS possible for a SNP to have two different annotations, due to alternative splicing: in these cases, the SNP is synonymous in some transcripts of a gene, and not incorporated in others (i.e. is intronic).

ADD REPLYlink written 6.0 years ago by Dgmacarthur310

If a snp is intronic, then it is in an intron, and therefore will never be incorporated into a transcript, regardless of alternative splicing. Are exons that are not incorporated into the transcript also called introns? Sorry, a little confused now.

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by Davy360
1

In alternative splicing there are multiple transcripts, so the multiple annotations refer to each possible transcript. For example, look at the top graphic on the Alternative Splicing Wikipedia page. If you had a SNP in the yellow alternative exon, then it would have two annotations: on the left transcript -- exon, and on the right transcript -- intron.

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by Brad Chapman9.2k

I thought that, to continue with the wiki example, introns are the black lines, and so any variant that is contained within the intron could never be included in any transcript, and that in the context of alternative splicing there is "exon shuffling" but wether or not they are included in the transcript, they are still exons, and that the two are mutually exclusive. But this is not the case it seems? Sorry to go off topic slightly. I just wanted clarification so I can update my understanding on how these annotations work.

ADD REPLYlink written 6.0 years ago by Davy360
1

Exon/intron only have meaning relative to the specific transcript you're looking at. So in the left transcript the yellow box is an exon and the green box is an intron (spliced out). In the right transcript the green box is an exon and the yellow box is an intron. Even though it is convenient to think of a condensed gene, for the purpose of considering the impact of a change you need to consider each transcript independently. This is why a variation can have multiple annotations.

ADD REPLYlink written 6.0 years ago by Brad Chapman9.2k

Right. In which case I retract what I previously said and apologise for the misinformation. Thanks Brad and Dgmacarthur. Now to go back to school.

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by Davy360

Thanks for introducing Annovar program. I will check annotations again and put here the results.

ADD REPLYlink modified 6.0 years ago by Istvan Albert ♦♦ 77k • written 6.0 years ago by Ali R. Vahdati180
11
gravatar for Dgmacarthur
6.0 years ago by
Dgmacarthur310
Cambridge, UK
Dgmacarthur310 wrote:

There are two forces in play here: mutation rate, which introduces new variants, and natural selection, which removes deleterious (harmful) variants.

Because more substitutions create missense rather than synonymous substitutions, we expect most new coding mutations to be missense. However, missense substitutions are more likely to be harmful and thus removed from the population, so the RATE OF OBSERVED SUBSTITUTION PER SITE (which I assume is what your textbooks are referring to) is always higher at synonymous sites. However, the OBSERVED NUMBER OF VARIANTS will often be higher for missense than synonymous variants, especially once you start digging down into the low-frequency variants that haven't had much of a chance to be affected by natural selection, as is the case for the 1000G data.

[Added in edit: worth mentioning that a similar pattern is seen in other large human sequence data-sets.]

ADD COMMENTlink modified 6.0 years ago • written 6.0 years ago by Dgmacarthur310
2

I agree with Daniel's explanation. A simple check would be to threshold the 1000G calls by some minor allele frequency (~1-5%) to check that you recover the expectation.

ADD REPLYlink written 6.0 years ago by Adam960

I have annotated all Hapmap SNPs fresh by Ensembl VEP. Hapmap SNPs are common with MAF of 5% and above. Out of 1560681 SNPs, 13508 were non-synonymous and 13767 were synonymous. So the syn/nonsyn ratio has changed for these high-freq SNPs, which is a confirmation of Daniel's explanation. The differences between syn and nonsyn still seems minor though.

ADD REPLYlink written 6.0 years ago by Ali R. Vahdati180

in my study, i have taken from 1000g only the most frequent SNPs with a MAF of 0,1 and above for the oxphos complexes and still there are differences in the numbers. perhaps it has something to do with the procedure (oxphos) that causes that to happen

 

ADD REPLYlink written 2.6 years ago by maria479020
4
gravatar for zam.iqbal.genome
6.0 years ago by
United Kingdom
zam.iqbal.genome1.6k wrote:

Does the "Validated by 1000 Genomes filter" mean validated, or called? 1000 Genomes chose to experimentally validate a very specific set of SNPs, which was not intended to be a random sample of all SNPs.

I think what you want to do is take a "discovered by 1000g" or "called in 1000g" filter, Your numbers are too low to be total numbers of SNPs called in 1000g which number in the millions. So - bottom line - you were quite right to be suspicious, and although I have not checked specifically, I'm sure you'll find more non-syn than syn SNPs once you download the correct set. I'm not a UCSC expert, so I'm afraid I can't help you actually do it though best Zam

ADD COMMENTlink written 6.0 years ago by zam.iqbal.genome1.6k

I am not completely sure, but I think that what they call validated by the 1000 genomes actually means called by the 1000 genome. The reason being that even in their paper (2010) they found 68,300 non-synonymous SNPs, 34,161 of which were novel. So, the number of SNPs are so much less than millions because of being limited to non-synonymous or synonymous ones, maybe. Also, when I downloaded SNP data from Biomart, regardless of their validation status, I found only 530427 unique non-synonymous SNPs.

ADD REPLYlink written 6.0 years ago by Ali R. Vahdati180

Excellent point.

ADD REPLYlink written 6.0 years ago by Davy360
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 722 users visited in the last hour