how do I get a vcf containing germline snps from dbSNP?
1
0
Entering edit mode
6.1 years ago
b10hazard ▴ 30

I'm trying to get a VCF file containing germline SNPs from NCBI's databases. This page says that I want the common_no_known_medical_impact.vcf.gz file and that I can find it at...

ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/

However that section only lists the common_all.vcf.gz file. Where can I get this file? Finding anything on NCBI's ftp site seems like an exercise in futility.

dbsnp ncbi vcf • 4.7k views
ADD COMMENT
6
Entering edit mode
6.1 years ago

The file that contains all ~13 million SNPs is (direct link): 00-All.vcf.gz

  • size is ~7 gigabytes
  • dbSNP version 150
  • GRCh37 / hg19 co-ordinates
  • tab index (direct link) 00-All.vcf.gz.tbi

The equivalent version (dbSNP 150 ) for GRCh38 / hg38 is (direct link): 00-All.vcf.gz (tab index 00-All.vcf.gz.tbi)

If you go to ftp://ftp.ncbi.nih.gov/snp/organisms/ , you can get information for many different species.

------------

For the clinically-related variants for which you are specifically searching, the file that you mentioned is no longer being produced. ClinVar now encodes evidence codes within the distributed VCFs. See the README in the vcf_GRCh37 directory found at: ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/

[look at the section entitled 'CHANGES MADE IN THE NEW FORMAT (2.0)]

That said, the file that you want is available in the archives, for example, take a look at any of the yearly release for ClinVar version 1.0:

ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/archive_1.0/

Kevin

ADD COMMENT
1
Entering edit mode

Excellent! Thanks for the links and the wonderful explanation!

ADD REPLY
1
Entering edit mode

Very useful ! Thanks !!

ADD REPLY
0
Entering edit mode

One more question. Are there redunant entries in dbSNP? I was trying to parse the common_no_known_medical_impact_20170905.vcf.gz file I downloaded from the links you posted but in this file there are about 38 million entries. dbSNP is only supposed to have 13 million, right?

ADD REPLY
0
Entering edit mode

dbSNP is constantly being curated and there are discrepancies in it. I don't fully know the extent of this, though.

I cannot confirm but, when you think about it, at each positon, there can be 4 possible bases. So, genome-wide, there are >10 billion possible bases to consider. For all dbSNP variants, the total would be ~50 million. I don't know if this logic explains the issue that you've found, though.

It says this about your file on the NCBI:

The file common_no_known_medical_impact.vcf.gz was created to provide users with an up-to-date report of common alleles not known to cause clinical phenotypes. This file can be used to subtract variants (filter) from a set of variant calls, thereby narrowing the list of variations that might warrant further evaluation for clinical significance. Should you wish to filter polymorphisms out of your whole genome/exome sequencing results, use the "common_no_known_medical_impact" file.

The "common_no_known_medical_impact.vcf.gz" file and the "clinvar.vcf.gz" file are not mutually exclusive because some variants asserted to be non-pathogenic that were obtained through clinical channels appear in both the "clinvar.vcf.gz" file and the "common_no_known_medical_impact.vcf.gz" file. Records for non-pathogenic variations that were submitted through clinical channels are marked as non-pathogenic and have allele frequencies consistent with a non-pathogenic status.

I don't know if that helps any further. All of this is a relatively novel area and I'm not sure how they can truly gauge pathogenic versus benign versus 'functional' with great confidence, given the current state of knowledge.

I do know that there are tools currently out there that attempt to assist in clinical exome variant filtering (and / or non-coding regulatory variants), such as:

This is an area of interest of mine right now, in fact.

ADD REPLY
0
Entering edit mode

Wait, my 13 million figure appears to have stuck in my head pre the release of the 1000 Genomes Phase III. dbSNP has currently amassed hundreds of millions of SNPs. Here are some release notes for dbSNP version 150: https://www.ncbi.nlm.nih.gov/projects/SNP/snp_summary.cgi?view+summary=view+summary&build_id=150

That explains your finding.

ADD REPLY

Login before adding your answer.

Traffic: 1944 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6