Question: Errors In Dbsnp
3
gravatar for Pi
8.0 years ago by
Pi510
Pi510 wrote:

Greetings

I am trying to get a handle on the quality of submissions to dbSNP. The list of validation statuses are given as:

  1. multiple independent submissions;
  2. frequency or genotype data;
  3. submitter confirmation;
  4. observation of all alleles in at least two chromosomes;
  5. genotyped by HapMap;
  6. sequenced in the 1000 Genomes Project

points 1,5 and 6 seem fairly reliable but I am interested in point 2/4.

How accurate does the genotype data need to be in point 2? For example we have carried out sequencing work and found potential snps only to find they were erroneous on resequencing. This data was for pooled DNA but we did have high quality counts for both alleles with good coverage.

edit: it has been pointed out by DQ (thank-you) that most genotypes are confirmed by sanger sequencing. Can i assume that the genotype and allele frequencies in dbSNP are based on confirmed genotypes via a method such as sanger sequencing?

Thank you for your time

dbsnp error • 1.7k views
ADD COMMENTlink modified 7.9 years ago by David Quigley11k • written 8.0 years ago by Pi510
4
gravatar for David Quigley
8.0 years ago by
David Quigley11k
San Francisco
David Quigley11k wrote:

The gold standard and simplest method for validating one or a few candidate SNPs is Sanger sequencing. There is a long paragraph in the wikipedia entry for dbSNP about data quality; there are some citations you may find helpful there. The NCBI has information about the validation statuses. See also this table. From NCBI:

"Validation by HapMap" in dbSNP simply means that a SNP was genotyped in HapMap (phase 1 & 2 over 270 samples, phase 3 over 1115 samples (not in dbSNP yet)...You should therefore look at "Validation by HapMap" in conjunction with "Validation by Frequency" to verify that the SNP’s minor allele has been observed at least twice

and

Validation by Frequency includes both population frequency data AND genotype data. In fact, the number of SNPs that have genotype data is bigger then the number of SNPs with only population frequency data. We compute frequency based on genotype data.

etc...

Not an expert here myself, and but this may point you in the right direction to make these codes a bit less cryptic.

ADD COMMENTlink written 8.0 years ago by David Quigley11k

Thanks for your answer. I am assuming then that when then say an allele has to have been seen in more than 2 chromosomes they mean by a technique as reliable as sanger sequencing because i've looked at NGS sequencing data for individuals that looks like they could be heterozygous and turn out to be homozygous for either the reference allele or the novel allele

ADD REPLYlink written 8.0 years ago by Pi510

Thanks for your answer. I am assuming then that when then say an allele has to have been seen in more than 2 chromosomes they mean by a technique as reliable as sanger sequencing because i've looked at NGS sequencing data for individuals that looks like they could be heterozygous and turn out to be homozygous for either the reference allele or the novel allele so that data in itself isn't reliable. I couldn't find anything that said what detection method was 'reliable enough'

ADD REPLYlink written 8.0 years ago by Pi510
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1478 users visited in the last hour