I am trying to screen my NGS variants against known variations. The dbSNP132 VCF files seem like a great resource, but I'm not certain if I understand what is in them.
Specifically, what is the significance of the population specific files? Are these referring to HapMap populations? For example, does this file (ftp://ftp.ncbi.nlm.nih.gov/snp/organisms/human_9606/VCF/v4.0/ByChromosomeNoGeno/01-1409-CEU-nogeno.vcf.gz) contain all the chromosome 1 SNPS from the CEU HapMap population in addition to any 1000Genomes and dbSNP132 SNPS that were also found in this population?
Jorge: You are absolutely correct, (and also much better at interpreting the README than I was).
Now, a second question I had was: what is in the so-called "full build" 00-All.vcf.gz? Presumbably, it contains the snps from all the listed populations, but what about snps that did not originate from a HapMap population? And what about 1000Genomes SNPS? Does this file contain EVERYTHING in dbsnp132?
p.s. I'm happy to post this as a brand new question if that would be better.
the README file states that 00-All.vcf.gz contains "a full build dump", although it doesn't describe it any further. it looks like it is a symlink to the 00-All.vcf.gz file contained in the ByChromosomeNoGeno folder, so I presume that it does contain HapMap SNPs only. if you are willing to deal with all dbSNP information I guess you will have to consider using the full chromosome reports instead. you will find them at ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/chr_rpts/
Actually, in the README inside the ByChromosomeNoGeno folder it says that:
Maybe the
00-All.vcf.gz
file is the one mentioned as "containing all SNPs in dbSNP". What do you think? Presumably, that would include 1000Genomes, too.it would be fairly simple to check that: if the number of SNPs on that file is ~4M these are HapMap only, if it is ~28M it would contain all the known SNPs on dbSNP132. the README file suggests the later, so go for it and let us know ;)