Question

dbSNP with RefSeq chromosome notation

0

Entering edit mode

2.1 years ago

Tom • 0

Hi everyone,

I'm currently trying to apply the GATK data pre-processing workflow on a large set of Whole Genome Sequencing data (almost 4TB of fastq files). In the final step, base-score recalibration is required (using the GATK BaseRecalibrator tool). To run this tool, I need a VCF with 'known sites of variation' (i.e., dbSNP).

Now, my problem is that I aligned the sequencing data to the RefSeq hg38 reference genome which uses the RefSeq chromosomal accession numbers (e.g., NC_000001.11, NC_000002.12,...) while the dbSNP resource uses the normal UCSC-style notation (e.g., chr1, chr2,...). Hence, GATK BaseRecalibrator can't handle this discrepancy between the chromosome notation in the aligned BAM file (NC_000001.11,...) and the dbSNP resource (ch1,...). Brute-force replacing of the chromosome notation in the 120+GB dbSNP file with a 'sed' command takes ages and I'm not even sure that will not result in new errors. Anyone know if there is a dbSNP resource available for download which uses the RefSeq chromosome notation, as this reference genome is widely used? Or any other efficient solutions to this problem?

Tom

germline dbSNP Genbank GATK • 906 views

ADD COMMENT • link updated 2.1 years ago by Pierre Lindenbaum 161k • written 2.1 years ago by Tom • 0

0

Entering edit mode

change the chromosome notation in the VCF: Replacing the Chr names and position notions in vcf

ADD REPLY • link 2.1 years ago by Pierre Lindenbaum 161k

score 0 · Answer 1 · 2022-04-08

0

Entering edit mode

2.1 years ago

Tom • 0

Guess I'll answer my own question. I found this old BioStarts question which seems to contain the solution to my specific problem. I will try this and accept this answer as the solution if it worked...

VCF: Replacing RefSeq ID to chr in #CHROM

ADD COMMENT • link 2.1 years ago by Tom • 0