How to use gathervcfs in GATK 4.1.4.0
12 months ago

Hi all, I used GenomicsDBImport to generate a genomic database from a bunch of gvcf files corresponding to a number of genomic intervals. For each interval, GenomicsDBImport produced a directory with contents that look like this:

[rmarcondes@boslogin02 db_180]$pwd /n/holyscratch01/edwards_lab/rafa/genomic_DBs/db_180 [rmarcondes@boslogin02 db_180]$ ls -a
.          111$1$5060538  114$1$7908999  117$1$5787846  callset.json        vidmap.json
..         112$1$3923849  115$1$2315063  118$1$2797076  __tiledb_workspace.tdb
110$1$3996139  113$1$1360579  116$1$1734749  119$1$4990981  vcfheader.vcf


Now I want to use Gathervcfs to merge all my gvcfs into a single vcf file. I assumed the files I needed to put through Gathervcfs were the "vcfheader.vcf" file in each interval directory, like this:

java -Xmx200g -XX:ParallelGCThreads=20 -jar \$GATKPATH GatherVcfs \
-O thevcf.vcf


But that just produced an empty vcf file with no variants, just a header and a list of all contigs.

What gives? Thanks for any pointers!

12 months ago
Ram 32k

The first thing that comes to mind when I see this is that the vcfheader.vcf, as aptly named, probably only contain VCF headers. This would make sense given that GenomicsDBImport is supposed to create a genomics DB workspace from a VCF file.

In other words, you're combining header files and expecting data to be populated out of thin air.

Your data is probably in one of the JSON files.