dbSNP with RefSeq chromosome notation
1
0
Entering edit mode
2.0 years ago
Tom • 0

Hi everyone,

I'm currently trying to apply the GATK data pre-processing workflow on a large set of Whole Genome Sequencing data (almost 4TB of fastq files). In the final step, base-score recalibration is required (using the GATK BaseRecalibrator tool). To run this tool, I need a VCF with 'known sites of variation' (i.e., dbSNP).

Now, my problem is that I aligned the sequencing data to the RefSeq hg38 reference genome which uses the RefSeq chromosomal accession numbers (e.g., NC_000001.11, NC_000002.12,...) while the dbSNP resource uses the normal UCSC-style notation (e.g., chr1, chr2,...). Hence, GATK BaseRecalibrator can't handle this discrepancy between the chromosome notation in the aligned BAM file (NC_000001.11,...) and the dbSNP resource (ch1,...). Brute-force replacing of the chromosome notation in the 120+GB dbSNP file with a 'sed' command takes ages and I'm not even sure that will not result in new errors. Anyone know if there is a dbSNP resource available for download which uses the RefSeq chromosome notation, as this reference genome is widely used? Or any other efficient solutions to this problem?

Tom

germline dbSNP Genbank GATK • 879 views
ADD COMMENT
0
Entering edit mode

change the chromosome notation in the VCF: Replacing the Chr names and position notions in vcf

ADD REPLY
0
Entering edit mode
2.0 years ago
Tom • 0

Guess I'll answer my own question. I found this old BioStarts question which seems to contain the solution to my specific problem. I will try this and accept this answer as the solution if it worked...

VCF: Replacing RefSeq ID to chr in #CHROM

ADD COMMENT

Login before adding your answer.

Traffic: 1689 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6