Hi everyone,
I'm currently trying to apply the GATK data pre-processing workflow on a large set of Whole Genome Sequencing data (almost 4TB of fastq files). In the final step, base-score recalibration is required (using the GATK BaseRecalibrator tool). To run this tool, I need a VCF with 'known sites of variation' (i.e., dbSNP).
Now, my problem is that I aligned the sequencing data to the RefSeq hg38 reference genome which uses the RefSeq chromosomal accession numbers (e.g., NC_000001.11, NC_000002.12,...) while the dbSNP resource uses the normal UCSC-style notation (e.g., chr1, chr2,...). Hence, GATK BaseRecalibrator can't handle this discrepancy between the chromosome notation in the aligned BAM file (NC_000001.11,...) and the dbSNP resource (ch1,...). Brute-force replacing of the chromosome notation in the 120+GB dbSNP file with a 'sed' command takes ages and I'm not even sure that will not result in new errors. Anyone know if there is a dbSNP resource available for download which uses the RefSeq chromosome notation, as this reference genome is widely used? Or any other efficient solutions to this problem?
Tom
change the chromosome notation in the VCF: Replacing the Chr names and position notions in vcf