8.1 years ago by
Santiago de Compostela, Spain
the definition of SNP has been fixed for years: a single base variation on the genome that is established within a population with a frequency of >1%. this definition tried to bring the idea that certain sites on the genome were more "flexible" than others, capable of being segregated to the offspring, and for that reason they could be measured in terms of population genetics through allele frequencies and haplotypes. that is what a SNP is and have always been, although I agree that how the term SNP is used on different databases may differ from this original definition.
this definition, that was possible coined as a reference on ~2000, did in fact work at the time it was defined, since the only SNP knowledge available came from the sequencing of a very few number of human samples, and for that reason plenty of frequency assumptions had to be done (this is the reason why the term "common variant" had to arise, because it was almost impossible to determine which variants were from an individual only and which ones could be shared with others). but now we are able to find variants through NGS that doesn't necessarily have to be common, and for that reason we are moving from genotyping SNPs (we search for DNA variation in particular places we already know) to sequencing variants (we read the genome and describe how it differs from an agreed reference).
the main problem of the databases' discordance is that their scope may not be the same, and for that reason one has to know where the data comes from and why that data has been stored there. this is why you can't directly compare dbSNP with HapMap, since HapMap data comes from the typing of ~4M already defined sites trying to determine as best as possible the haplotype distribution in humans, and dbSNP is "just" a variation repository. major dbSNP updates have in fact come from HapMap data, but once that ~4M sites are known all that dbSNP may obtain from HapMap is "just" population data, but not new SNPs.
so, to conclude, let me get back to the genotyping vs. sequencing issue. HapMap has been maintained the original SNP concept through their database since they started because what they do is genotyping, and they already know that those ~4M sites they are studying are indeed polymorphic. but dbSNP since build 130 started to accept submissions from 1000 Genomes, which contained rare variants since the technique used is now genotyping. this has lowered the frequency threshold for the variations present on dbSNP from a minimum of ~1% to ~0.1%, so what you find on dbSNP doesn't have to be strictly a SNP. this is why the term SNV (single nucleotyde variation) is getting favoured over the term SNP.
as a final suggestion, you definitely have to be aware of which database you are working with, and which kind of data is stored in it. knowing why that data was stored there would also help understanding what you may extract from it, and how you may use those results.