Nomenclature Of Snps
5
8
Entering edit mode
13.4 years ago
Bio_X2Y ★ 4.4k

andrea_bio's question has got me thinking about SNPs in a bit more detail, and I find myself confused. I'm hoping someone can help clarify some of the nomenclature being used around SNPs/alleles - I'm finding it difficult to track down exact definitions.

In relation to SNPs, if there is a base that is known to take three different values, one of which is the reference value appearing in the reference genome (e.g. A in reference, T and G are known variants), do we:

(a) have 2 different SNPs (e.g. A->T, A->G)

OR

(b) have just one SNP, which happens to have 3 different alleles? (e.g. A, T, G)

Also, NCBI's SNP primer page says that "On average, SNPs occur in the human population more than 1 percent of the time." Could somebody possibly clarify what is meant by this in more exact language? e.g. does it imply that the term SNP is only applied to variants if they are more common than 1% of the population?

Thanks for your time.

snp maf • 9.0k views
ADD COMMENT
9
Entering edit mode
13.4 years ago

This is easier to understand if you know how the large genotype panels are designed and how tag SNPs are chosen.

Let's take HapMap as an example. Since they could not afford to genotype all the SNPs in the genome, they had to choose which tagSNPs to include in the chips they were using. In order to choose these snps, they have first genotyped a larger number of snps in a smaller subset of individuals, which were all European individuals. This is referred to as the 'SNP discovery' phase.

In this phase, they only chose SNPs which had a minor allele frequency (MAF) > 0.1% in the smaller European dataset, so the allele with the minor frequency had at least a frequency > 0.1%. Of course, this definition introduces a lot of problems, because it may be that the same snps have a different frequency in other populations, and moreover a lot of snps which are present only in other populations are not included. This set of problems is usually referred to as the 'ascertainment bias' of the snp panel.

Moreover, usually SNPs are chosen only if they are biallelic in the discovery dataset. This simplifies the analysis and decreases the cost of the chips; however it is possible that in the other populations not included in the discovery dataset, some of the snps have a third allelic variant. When you are working with SNP data, commonly the first step is to remove all the multi-allelic SNPs, which are not considered as real SNPs.

The 1000 genomes project has solved a lot of these problems, because they have genotyped almost all the snps they have found, and even with MAF<0.1%.

The phrase that you quoted from the NCBI primer, "On average, SNPs occur in the human population more than 1 percent of the time", means that the SNP has a MAF > 0.1% in the SNP discovery panel.

Some literature:

ADD COMMENT
0
Entering edit mode

Thanks for the detailed response! Both links are pointing to the same place at the moment, maybe you could update the first one? I don't think I have permission to edit.

ADD REPLY
0
Entering edit mode

ops!! thank you for pointing out, I've fixed it now..

ADD REPLY
5
Entering edit mode
13.4 years ago
Andrea_Bio ★ 2.8k

Hello,

question 1 The notation used by dbSNP and the new notation that will be adopted by ensembl soon (according to a recent email conversation i had with them) is that you have one snp with three alleles and it is described as A/T/G

question 2) i have asked this question on here myself from a different perspective. A variation is a snp if it occurs in 1% of the population. How big does the population have to be before you you can say that an allele that is present in 1% of the population. If i have a population where n=20 and all individuals have a specific variation allele I don't think you could class that as a SNP - it's just an infrequent variation (i have also asked another question about the difference between a snp and an infrequent variation too!). I didn't get an answer! But i believe many of the 'snps' in snp databases are rather variations that have been observed rather than actual 'snps' that meet the 1% rule

ADD COMMENT
0
Entering edit mode

the 1% MAF is calculated in the SNP discovery panel. See my answer.

ADD REPLY
5
Entering edit mode
13.4 years ago
lh3 33k

There are 28.8 million SNPs/INDELs in dbSNP132. I believe this is how this 1% number is derived (28.8M/2.85GB). In principle, it would be good to define this "1%" by accounting for MAF, but a lot of SNPs in dbSNP do not have this information. I believe that sentence in the dbSNP primer does not mean to be precise; it just gives people a very rough idea.

In addition, the "SNP discovery phase" is only applied to chip genotyping. G1k has only chip/sequenom-genotyped a tiny fraction of novel low-frequency SNPs. A new chip is coming, but it still covers only a small fraction of SNPs discovered by the project. Also, g1k can only access MAF<0.1% SNPs in exomes. In the whole genome, SNPs from only 270 individuals (540 chromosomes, or MAF<0.2%) have been dumped to dbSNP, I believe.

ADD COMMENT
4
Entering edit mode
13.4 years ago

dbSNP does include rare variants since build #130 at least, due to the load of 1000 Genomes data. the term SNP as a synonym for frequent variation does not seem to apply now on dbSNP as a whole as it used to. it is maybe true that "on average" all the recorded variants may be above 1% in the populations studied, but this is no longer happening in current build #132.

regarding the nomenclature doubt, the rule is to cite the reference allele in first place and then the rest of the alleles. I cannot provide a deeper description for the non reference alleles, since the allele sorting sometimes follows their frequencies and sometimes is just alphabetical.

but let me bring some light into a definition which we tend not to dig deeper: the reference allele. it is the allele found on the reference strand of the reference sequence, and this means that if the SNP was found on the other strand you may have a ATG SNP just because the A is the reference allele, but the SNP is really a TG only. you can only be sure about a SNP being multi-allelic if you check that its strand corresponds to the reference orientation. for that reason we have adopted the "oriented reference allele" in our research, which is the corresponding allele of the reference sequence found on the same strand of the SNP. just something to have in mind.

ADD COMMENT
3
Entering edit mode
13.4 years ago

Differences between a SNP and a mutation and a rare or private variant are not defined by clear boundaries of allele frequency. I could give you some numbers I heard from one expert, and then turn around a give different numbers that I heard from another person in the field. NCBI may give you a "1% rule" but for which population is that? Perhaps an allele at a polymorphic position is seen at a frequency of 0.008 in a group with European ancestry but at 0.022 frequency in a group from Africa. This is clearly not the case of not a SNP in Europeans, but a SNP in (some) Africans. The position is polymorphic - the P in SNP - so I would call it a SNP and not be hung up on strict definitions.

ADD COMMENT

Login before adding your answer.

Traffic: 3148 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6