I am trying to get a handle on the quality of submissions to dbSNP. The list of validation statuses are given as:
multiple independent submissions;
frequency or genotype data;
observation of all alleles in at least two chromosomes;
genotyped by HapMap;
sequenced in the 1000 Genomes Project
points 1,5 and 6 seem fairly reliable but I am interested in point 2/4.
How accurate does the genotype data need to be in point 2? For example we have carried out sequencing work and found potential snps only to find they were erroneous on resequencing. This data was for pooled DNA but we did have high quality counts for both alleles with good coverage.
edit: it has been pointed out by DQ (thank-you) that most genotypes are confirmed by sanger sequencing. Can i assume that the genotype and allele frequencies in dbSNP are based on confirmed genotypes via a method such as sanger sequencing?
"Validation by HapMap" in dbSNP simply means that a SNP was genotyped in HapMap (phase 1 & 2 over
270 samples, phase 3 over 1115 samples (not in dbSNP yet)...You should therefore look at
"Validation by HapMap" in conjunction with "Validation by Frequency" to verify that the SNP’s minor
allele has been observed at least twice
Validation by Frequency includes both population frequency data AND genotype
data. In fact, the number of SNPs that have genotype data is bigger then the
number of SNPs with only population frequency data. We compute frequency
based on genotype data.
Not an expert here myself, and but this may point you in the right direction to make these codes a bit less cryptic.