Question: What Is The Significance Of The Population-Specific Dbsnp132 Vcf Files?
2
gravatar for Epowell
8.3 years ago by
Epowell20
Epowell20 wrote:

I am trying to screen my NGS variants against known variations. The dbSNP132 VCF files seem like a great resource, but I'm not certain if I understand what is in them.

Specifically, what is the significance of the population specific files? Are these referring to HapMap populations? For example, does this file (ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/v4.0/ByChromosomeNoGeno/01-1409-CEU-nogeno.vcf.gz) contain all the chromosome 1 SNPS from the CEU HapMap population in addition to any 1000Genomes and dbSNP132 SNPS that were also found in this population?

vcf hapmap population dbsnp genome • 3.0k views
ADD COMMENTlink modified 8.3 years ago by Jorge Amigo11k • written 8.3 years ago by Epowell20
2
gravatar for Jorge Amigo
8.3 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

dbSNP contains SNP data coming from different submitters, being HapMap and 1000 Genomes the most important ones when talking about population information. if you query for any SNP through dbSNP's web interface, say rs6059134 for instance, you will see that there is information from several populations included on the database.

since I presume that you are wondering what is exactly inside the files given by the dbSNP VCF ftp site I can only suggest you to check the readme file on that ftp site and have a look to the file naming convention example to find out that all the population numeric codes of that list correspond to HapMap data only.

14-12162-MKK.vcf.gz

Chr number => 14

dbSNP population ID => 12162 (http://www.ncbi.nlm.nih.gov/projects/SNP/snp_viewTable.cgi?pop=12162)

Three letter population identifier => MKK (http://ccr.coriell.org/sections/collections/NHGRI/?SsId=11)

ADD COMMENTlink written 8.3 years ago by Jorge Amigo11k

Jorge: You are absolutely correct, (and also much better at interpreting the README than I was).

Now, a second question I had was: what is in the so-called "full build" 00-All.vcf.gz? Presumbably, it contains the snps from all the listed populations, but what about snps that did not originate from a HapMap population? And what about 1000Genomes SNPS? Does this file contain EVERYTHING in dbsnp132?

p.s. I'm happy to post this as a brand new question if that would be better.

ADD REPLYlink written 8.3 years ago by Epowell20

the README file states that 00-All.vcf.gz contains "a full build dump", although it doesn't describe it any further. it looks like it is a symlink to the 00-All.vcf.gz file contained in the ByChromosomeNoGeno folder, so I presume that it does contain HapMap SNPs only. if you are willing to deal with all dbSNP information I guess you will have to consider using the full chromosome reports instead. you will find them at ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/chr_rpts/

ADD REPLYlink written 8.3 years ago by Jorge Amigo11k

Actually, in the README inside the ByChromosomeNoGeno folder (ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/v4.0/ByChromosomeNoGeno/00-00-README.txt) it says that: This directory contains VCF files for each chromosome and HapMap populations available in dbSNP as well as a file containing all SNPs in dbSNP with the excetion of microsatellites, named variations, and other multi-byte variations where the adjacent nucelotides are unknown." Maybe the 00-All.vcf.gz file mentioned as "containing all SNPs in dbSNP". What do you think? Presumably, that would include 1000Genomes, too.

ADD REPLYlink written 8.3 years ago by Epowell20

Actually, in the README inside the ByChromosomeNoGeno folder (ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/v4.0/ByChromosomeNoGeno/00-00-README.txt) it says that: "This directory contains VCF files for each chromosome and HapMap populations available in dbSNP as well as a file containing all SNPs in dbSNP with the excetion of microsatellites, named variations, and other multi-byte variations where the adjacent nucelotides are unknown." Maybe the 00-All.vcf.gz file is the one mentioned as "containing all SNPs in dbSNP". What do you think? Presumably, that would include 1000Genomes, too.

ADD REPLYlink written 8.3 years ago by Epowell20

it would be fairly simple to check that: if the number of SNPs on that file is ~4M these are HapMap only, if it is ~28M it would contain all the known SNPs on dbSNP132. the README file suggests the later, so go for it and let us know ;)

ADD REPLYlink written 8.3 years ago by Jorge Amigo11k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 808 users visited in the last hour