Question

NA12878 High Confidence Callset Sequence Dictionaries

0

Entering edit mode

4.7 years ago

ilee66 • 0

I want to compare a WGS vcf callset I have produced using the GATK best practices with the NA12878 WGS gold-standard/high confidence vcf callset but an error related to Differing Sequence Dictionary sizes is preventing me form performing any concordance analysis.

I have downloaded the NA12878 "High Confidence Callset" (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/GIABPedigreev0.2/) (I have tried this specific release as well and the one under "latest") and when I try to compare this .vcf to a .vcf I have produced I get an error that the dictionary sizes are different (code below).

From what I've gathered so far this error likely arises from alignment with different reference genomes. I first got this error when aligning & calling myself when I was using the human_g1k_v37 reference genome (I have been unable to find the reference genome these gold-standard vcf files were developed under). I also then downloaded the RMNISTHS_30xdownsample.bam file (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/data/NA12878/NIST_NA12878_HG001_HiSeq_300x/) The Readme mentions it was aligned with BWA MEM but not which reference genome, but I got the same error as before. I assumed that the RMNISTHS_30xdownsample.bam file from the NCBI FTP was aligned with the same reference genome as the vcf from the same FTP, but I still get the error.

The GenotypeConcordance code that produces the error is as follows:

/path/to/gatk GenotypeConcordance -CV=/path/to/myinput.vcf.gz -O=/path/to/output.vcf -TV=/path/to/NIST_RTG_PlatGen_merged_highconfidence_v0.2_Allannotate.vcf.gz`

The error:

htsjdk.samtools.util.SequenceUtil$SequenceListsDifferException: Sequence Dictionaries are not the same size (25, 181)
at htsjdk.samtools.util.SequenceUtil.assertSequenceListsEqual(SequenceUtil.java:250)at htsjdk.samtools.util.SequenceUtil.assertSequenceDictionariesEqual(SequenceUtil.java:333)
        at htsjdk.samtools.util.SequenceUtil.assertSequenceDictionariesEqual(SequenceUtil.java:319)
        at picard.vcf.GenotypeConcordance.doWork(GenotypeConcordance.java:350)
        at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:295)
        at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:25)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:162)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:205)
        at org.broadinstitute.hellbender.Main.main(Main.java:291)

If anyone has knows whether there is a tool (within GATK or another) that disregards the differing sequence dictionary lengths (An earlier version of GATK had an option for this but I cant find this option in GATK4) that would be awesome.

Thanks in advance for any ideas/help/advice

vcf genome alignment gatk • 1.8k views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 4.7 years ago by ilee66 • 0

score 1 · Answer 1 · 2019-07-25

1

Entering edit mode

4.7 years ago

Pierre Lindenbaum 161k

see changing of chromosome notation in CHROM columns of vcf file

ADD COMMENT • link 4.7 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thank you Pierre. Exactly what I was looking for.

ADD REPLY • link 4.7 years ago by ilee66 • 0