Hi everyone,
I am new to bioinformatics and I am struggling with GATK's somatic mutation variant calling pipeline.
I have completed most of the preprocessing steps: CreateSequenceDictionary, bwa index, bwa mem, and MarkDuplicatesSpark.
Yet, I've been struggling with a UserError on the BaseRecalibrator step.
For my known sites file, I have been using a C57/BL6 known sites vcf file I found on the Mouse Genome project website.
For the reference genome, I used the GRCm39 latest release.
My initial error with BaseRecalibrator was that my contigs were incompatible between reference and vcf file. I tried to solve this by using bcftools annotate --rename-chrs to alter the vcf files.
Yet, now I am getting a new error:
A USER ERROR has occurred: Input files reference and features have incompatible contigs: Found contigs with the same name but different lengths: contig reference = NC_000067.7 / 195154279 contig features = NC_000067.7 / 195471971.
At this point, I am not sure if I should just redo the analysis with an older version of the mouse reference genome, or if this error can be fixed. Any pointers?
I'm an idiot... I just checked, yes, the vcf file was for the GRCm38_68 from Sanger. That makes total sense. I think this was the issue. Thanks a lot!
I am running into the same issue, except with a balb/c reference genome that has contigs that don't match my known sites vcf file and my sequencing data contigs.
In this case, do you recommend that I use any particular tool to convert the contigs of my reference genome to match my sequencing data? Is this common practice?
If not, what is usually done when there is only one balb/c genome build version available, and my sequencing data contigs don't match.
I figured it out, to anybody who may be wondering in the future.
Use picard UpdateVcfSequenceDictionary with Input as your old vcf and -SD as your reference genome.
Then, index the newly generated vcf file and use this known sites file instead for your gatk analysis.