In what percentage of a population must a base position/locus be invariant before it is classed as conserved, or a null SNP to use a dnSNP definition?
Also, when people say a SNP must be present in more than 1% of the population before it is classed as a SNP, what is the minimum size of the population? ( The population size also applies to the definition of the null SNP).
I also presume many of the 'SNPs' identified at present can't truly be SNPs as 1% of the population hasn't been sequenced.
What definition is the 1000 genomes project using to define a SNP? Are they looking for a variation in 10 individuals. I was wondering too if they had identified any null SNPs. I will be reading the paper they just released later today to find out more but i didn't think it would do any harm asking while I was posting this question.
SNP is not and shall not be defined by frequency. I used to see a SNP definition by frequency like yours, but I insist such a definition is scientifically crippled because without knowing the true frequency you cannot precisely prove if a site is a SNP. I tend to simply define a SNP as a site that differs among all individuals of a species.
I do not think it is possible to define "null SNP" (google does not know it, either). We may say the site is invariant among the samples we are looking at, but you almost never know if it is a variant when we increase the sample size.
The 1000 genomes project provides several files to indicate the regions where SNPs are called. You can assume non-SNP sites in those regions are invariant among samples, but that is very inaccurate.
Note that dbSNP takes the looser 'variation' definition for SNPs, so there is no requirement or assumption about minimum allele frequency.
Then sadly we can never confidently say if a site is a SNP because this would require to sequence all human individuals. We can only say something like: at a 0.001% of chance the site is a SNP. Anyway, biology is not mathematics...
I work primarily with non-model plant species, not human, but we've done quite a bit of SNP calling from RNA-seq data. Our SNP finding software includes two parameters you brought up in your question: a filter for invariant percentage and a filter for coverage. If the percentage of the invariant is extremely low in comparison to the major variant, then we can't be sure it's not due to something like sequencing errors. We also don't have much confidence in our SNP call if the sequencing coverage of that particular nucleotide is too low. For the amount of data we have, we've been using numbers around 8-12x for our coverage filter and 15-30% for the invariant percentage filter.I presume that projects of the scale of the 1000 genomes project can get away with higher coverage filters and lower percentage filters (like the 1% you described), but selection of these parameter values is going to be very dependent on the amout of data you have in a given project.
I heard null alleles but null snps is news to me. There is a reference -albeit the term reference is problematic itself- and there are variants. If a single nucleotide variant SNV is found commonly in the population than it becomes a polymorphism making it a SNP. The population size is important but only to make the frequency data meaningful. Generally 200 individuals is the lower threshold before talking about a SNP frequency in a population. The more individuals you genotype the better it is but 200 is generally enough.
You don't need to sequence 1% of the population to determine a SNPs frequency in that population. There is something called sampling in statistics and if you obey the scientific method of sampling in your studies you can safely generalize your results to the statistical universe( in this case the population in question).
You should check the 1000 genomes project web site to get more information on pilots and how many individuals are being sequenced but I can tell you that 1000 genomes project has more than 1000 samples and all their variation data is annotated with the number of individuals that variant is found in.
Hope this was helpful.
the term SNP will be soon deprecated favouring "frequent variation". although the term SNP gives a quick and precise idea of what it means, we have realized that defining something by its frequency (which obviously varies among populations) is probably not the best thing to do. all the variations observed by 1000 genomes are therefore labeled as "variants", but in my opinion SNP will be still used as an useful "visual threshold" and a short synonym for that mentioned "frequent variation", in contrast to "mutation" or "very rare variation".