Question: NCBI dbSNP compatibility w/ Ensembl whole genome
1
gravatar for umn_bist
3.3 years ago by
umn_bist320
umn_bist320 wrote:

I downloaded dbSNP build 141 from NCBI and GRCh38.p5 from Gencode. I am using both for GATK BaseRecalibrator but I receive an error caused by 'chr' annotation.

Specifically the genome sequence has 'chr' and also unplaced contigs but the SNP vcf file does not. I am wondering if I can simply append 'chr' into the SNP file (assuming that unplaced contigs are included) or if there is a SNP file (that has indels included) for Ensembl genome (ideally for both GRCh38 and GRCh37).

EDIT: Upon further inspection, the SNP vcf file with 'papu' notation included has unplaced contigs, but this still does not include 'chr' notation. I also found that Ensembl has its own dbSNP (version 144) that corresponds to Ensembl 83 (GRCh38) but I do not see a download link. I also see that UCSC adopted the Gencode/Ensembl format but their SNP does not include ones for unplaced contigs. First, I am wondering if this matters for the purpose of running GATK, and, second, is it possible to merge common, clinically associated, and multimapped variants into 1 vcf? Is this advisable?

ERROR   /00-All.vcf contigs = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT]

ERROR   reference contigs = [chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19, chr20, chr21, chr22, chrX, chrY, chrM, GL000008.2, GL000009.2, GL000194.1, GL000195.1, GL000205.2, GL000208.1, GL000213.1, GL000214.1, GL000216.2, GL000218.1, GL000219.1, GL000220.1, GL000221.1, GL000224.1, GL000225.1, GL000226.1, KN538364.1, KQ031383.1, KN538369.1, JH159136.1, JH159137.1, KQ031387.1, KN538360.1, KN196484.1, KN196476.1, KN196479.1, KN196473.1, KN196487.1, KN196475.1, KQ090016.1, KN538361.1, KN196474.1, KQ090022.1, KN196478.1, KN196480.1, KQ090028.1, KN196483.1, KN196481.1, KN538363.1, KN538362.1, KQ031385.1, KQ031386.1, KQ031388.1, KN538365.1, KN538366.1, KN538367.1, KN538370.1, KN538373.1, KN538371.1, KQ031384.1, KN538372.1, KQ090021.1, KN196482.1, KQ458386.1, KN196472.1, GL383545.1, GL383546.1, KI270824.1, KI270825.1, KQ090020.1, GL383547.1, KN538368.1, KI270826.1, KI270827.1, KI270829.1, KI270830.1, KI270831.1, KI270832.1, KI270902.1, KI270903.1, KI270927.1, GL877875.1, GL383549.1, GL383550.2, KQ090023.1, GL877876.1, GL383552.1, KI270904.1, GL383553.2, KI270835.1, GL383551.1, KI270837.1, KI270833.1, KI270834.1, KI270836.1, KI270838.1, KI270839.1, KI270840.1, KI270841.1, KI270842.1, KI270843.1, KQ090024.1, KQ090025.1, KI270844.1, KI270845.1, KI270846.1, KI270847.1, KI270852.1, KI270848.1, GL383554.1, KI270906.1, GL383555.2, KI270851.1, KI270849.1, KI270905.1, KI270850.1, KQ031389.1, KI270853.1, GL383556.1, GL383557.1, KI270855.1, KQ031390.1, KI270856.1, KQ090027.1, KQ090026.1, KI270854.1, KI270909.1, GL383563.3, KI270861.1, GL383564.2, GL000258.2, KI270860.1, KI270907.1, KI270862.1, ... ...

(contracted to meet character limit)

ensembl gatk ncbi • 1.5k views
ADD COMMENTlink modified 7 months ago by RamRS21k • written 3.3 years ago by umn_bist320

When I printed the first and last 3000 lines of NCBI's 00-All_papu.vcf (which has SNP for unplaced contigs). the chromosome notation had NT_113889.1 and NW_009646209.1 respectively.

The genome reference that this SNP corresponds to (GRCh38) from Ensembl does not have any unplaced contig starting with NT or NW (they only start with GL, KN, KW, JH, KI). Does this require editing unplaced contig notations in my dbSNP file to match that of my genome reference?

ADD REPLYlink modified 8 months ago by RamRS21k • written 3.2 years ago by umn_bist320
2
gravatar for Pierre Lindenbaum
3.3 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum120k wrote:

I am wondering if I can simply append 'chr' into the SNP file

no chrM is the exception -> MT

sed -e '/^[^#]/s/^/chr/' -e 's/^chrMT/chrM/'

last time I looked at the NCBI vcf, there was no VCF sequence dictionnary (##contig lines). You could insert it with picard UpdateVcfSequenceDictionary

ADD COMMENTlink modified 8 months ago by RamRS21k • written 3.3 years ago by Pierre Lindenbaum120k

Thank you for your reply. This is exactly what I needed. Would I use the GRCh38.dict file as my sequence dictionary to update the dbSNP vcf file? Thanks again.

ADD REPLYlink modified 3.2 years ago • written 3.3 years ago by umn_bist320
1

yes that should work.

ADD REPLYlink written 3.2 years ago by Pierre Lindenbaum120k

So I found that Ensembl has a publicly available file corresponding to GRCh38 release 83. If I'm looking at tumor samples, wouldn't I want both germline and somatic variations, and is it advisable to merge the two files? Is the somatic variation file equivalent to Sanger's COSMIC file?

ADD REPLYlink modified 8 months ago by RamRS21k • written 3.2 years ago by umn_bist320
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 671 users visited in the last hour