Question: how do I get a vcf containing germline snps from dbSNP?
0
gravatar for b10hazard
3 months ago by
b10hazard20
United States
b10hazard20 wrote:

I'm trying to get a VCF file containing germline SNPs from NCBI's databases. This page says that I want the common_no_known_medical_impact.vcf.gz file and that I can find it at...

ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b150_GRCh37p13/VCF/

However that section only lists the common_all.vcf.gz file. Where can I get this file? Finding anything on NCBI's ftp site seems like an exercise in futility.

dbsnp vcf ncbi • 251 views
ADD COMMENTlink modified 3 months ago by Kevin Blighe21k • written 3 months ago by b10hazard20
2
gravatar for Kevin Blighe
3 months ago by
Kevin Blighe21k
University College London Cancer Institute
Kevin Blighe21k wrote:

The file that contains all ~13 million SNPs is (direct link): 00-All.vcf.gz

  • size is ~7 gigabytes
  • dbSNP version 150
  • GRCh37 / hg19 co-ordinates
  • tab index (direct link) 00-All.vcf.gz.tbi

The equivalent version (dbSNP 150 ) for GRCh38 / hg38 is (direct link): 00-All.vcf.gz (tab index 00-All.vcf.gz.tbi)

If you go to ftp://ftp.ncbi.nih.gov/snp/organisms/ , you can get information for many different species.

------------

For the clinically-related variants for which you are specifically searching, the file that you mentioned is no longer being produced. ClinVar now encodes evidence codes within the distributed VCFs. See the README in the vcf_GRCh37 directory found at: ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/

[look at the section entitled 'CHANGES MADE IN THE NEW FORMAT (2.0)]

That said, the file that you want is available in the archives, for example, take a look at any of the yearly release for ClinVar version 1.0:

ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh37/archive_1.0/

Kevin

ADD COMMENTlink modified 3 months ago • written 3 months ago by Kevin Blighe21k
1

Excellent! Thanks for the links and the wonderful explanation!

ADD REPLYlink written 3 months ago by b10hazard20

One more question. Are there redunant entries in dbSNP? I was trying to parse the common_no_known_medical_impact_20170905.vcf.gz file I downloaded from the links you posted but in this file there are about 38 million entries. dbSNP is only supposed to have 13 million, right?

ADD REPLYlink written 3 months ago by b10hazard20

dbSNP is constantly being curated and there are discrepancies in it. I don't fully know the extent of this, though.

I cannot confirm but, when you think about it, at each positon, there can be 4 possible bases. So, genome-wide, there are >10 billion possible bases to consider. For all dbSNP variants, the total would be ~50 million. I don't know if this logic explains the issue that you've found, though.

It says this about your file on the NCBI:

The file common_no_known_medical_impact.vcf.gz was created to provide users with an up-to-date report of common alleles not known to cause clinical phenotypes. This file can be used to subtract variants (filter) from a set of variant calls, thereby narrowing the list of variations that might warrant further evaluation for clinical significance. Should you wish to filter polymorphisms out of your whole genome/exome sequencing results, use the "common_no_known_medical_impact" file.

The "common_no_known_medical_impact.vcf.gz" file and the "clinvar.vcf.gz" file are not mutually exclusive because some variants asserted to be non-pathogenic that were obtained through clinical channels appear in both the "clinvar.vcf.gz" file and the "common_no_known_medical_impact.vcf.gz" file. Records for non-pathogenic variations that were submitted through clinical channels are marked as non-pathogenic and have allele frequencies consistent with a non-pathogenic status.

I don't know if that helps any further. All of this is a relatively novel area and I'm not sure how they can truly gauge pathogenic versus benign versus 'functional' with great confidence, given the current state of knowledge.

I do know that there are tools currently out there that attempt to assist in clinical exome variant filtering (and / or non-coding regulatory variants), such as:

This is an area of interest of mine right now, in fact.

ADD REPLYlink modified 3 months ago • written 3 months ago by Kevin Blighe21k

Wait, my 13 million figure appears to have stuck in my head pre the release of the 1000 Genomes Phase III. dbSNP has currently amassed hundreds of millions of SNPs. Here are some release notes for dbSNP version 150: https://www.ncbi.nlm.nih.gov/projects/SNP/snp_summary.cgi?view+summary=view+summary&build_id=150

That explains your finding.

ADD REPLYlink written 3 months ago by Kevin Blighe21k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2045 users visited in the last hour