I want to compare a WGS vcf callset I have produced using the GATK best practices with the NA12878 WGS gold-standard/high confidence vcf callset but an error related to Differing Sequence Dictionary sizes is preventing me form performing any concordance analysis.
I have downloaded the NA12878 "High Confidence Callset" (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/GIABPedigreev0.2/) (I have tried this specific release as well and the one under "latest") and when I try to compare this .vcf to a .vcf I have produced I get an error that the dictionary sizes are different (code below).
From what I've gathered so far this error likely arises from alignment with different reference genomes. I first got this error when aligning & calling myself when I was using the human_g1k_v37 reference genome (I have been unable to find the reference genome these gold-standard vcf files were developed under). I also then downloaded the RMNISTHS_30xdownsample.bam file (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/) The Readme mentions it was aligned with BWA MEM but not which reference genome, but I got the same error as before. I assumed that the RMNISTHS_30xdownsample.bam file from the NCBI FTP was aligned with the same reference genome as the vcf from the same FTP, but I still get the error.
The GenotypeConcordance code that produces the error is as follows:
/path/to/gatk GenotypeConcordance -CV=/path/to/myinput.vcf.gz -O=/path/to/output.vcf -TV=/path/to/NIST_RTG_PlatGen_merged_highconfidence_v0.2_Allannotate.vcf.gz`
The error:
htsjdk.samtools.util.SequenceUtil$SequenceListsDifferException: Sequence Dictionaries are not the same size (25, 181)
at htsjdk.samtools.util.SequenceUtil.assertSequenceListsEqual(SequenceUtil.java:250)at htsjdk.samtools.util.SequenceUtil.assertSequenceDictionariesEqual(SequenceUtil.java:333)
at htsjdk.samtools.util.SequenceUtil.assertSequenceDictionariesEqual(SequenceUtil.java:319)
at picard.vcf.GenotypeConcordance.doWork(GenotypeConcordance.java:350)
at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295)
at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25)
at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
at org.broadinstitute.hellbender.Main.main(Main.java:291)
If anyone has knows whether there is a tool (within GATK or another) that disregards the differing sequence dictionary lengths (An earlier version of GATK had an option for this but I cant find this option in GATK4) that would be awesome.
Thanks in advance for any ideas/help/advice
Thank you Pierre. Exactly what I was looking for.