Error in GATK GenomicsDBImport
12 months ago


I have a large number of gvcf files that I'm trying to joint genotype, by first running GenomicsDBImport in GATK When I say large I mean 135 samples * 229 genomic intervals = 30,915 files.

Here's what I have:

java -Xmx80g -XX:ParallelGCThreads=20 -jar $GATKPATH GenomicsDBImport -L $LIST \
-V ${SLURM_ARRAY_TASK_ID}.1.raw.g.vcf \
-V ${SLURM_ARRAY_TASK_ID}.2.raw.g.vcf \
-V ${SLURM_ARRAY_TASK_ID}.3.raw.g.vcf \
-V ${SLURM_ARRAY_TASK_ID}.4.raw.g.vcf \
-V ${SLURM_ARRAY_TASK_ID}.5.raw.g.vcf \
-V ${SLURM_ARRAY_TASK_ID}.6.raw.g.vcf \
-V ${SLURM_ARRAY_TASK_ID}.133.raw.g.vcf \
-V ${SLURM_ARRAY_TASK_ID}.134.raw.g.vcf \
-V ${SLURM_ARRAY_TASK_ID}.135.raw.g.vcf \
--merge-input-intervals true \
--genomicsdb-workspace-path /n/holyscratch01/edwards_lab/rafa/genomic_DBs/db_${SLURM_ARRAY_TASK_ID}

where list points the location of the scaffold list for each interval, and the task ID identifies the interval.

This runs for a while but then this happens:

13:43:09.139 INFO  NativeLibraryLoader - Loading from jar:file:/n/holyscratch01/edwards_lab/rafa/gatk-package-!/com/intel/gkl/native/
Mar 20, 2020 1:43:13 PM runningOnComputeEngine
INFO: Failed to detect whether we are running on Google Compute Engine.
13:43:13.385 INFO  GenomicsDBImport - ------------------------------------------------------------
13:43:13.385 INFO  GenomicsDBImport - The Genome Analysis Toolkit (GATK) v4.1.4.0
13:43:13.385 INFO  GenomicsDBImport - For support and documentation go to
13:43:14.389 INFO  GenomicsDBImport - Executing as on Linux v3.10.0-957.12.1.el7.x86_64 amd64
13:43:14.389 INFO  GenomicsDBImport - Java runtime: Java HotSpot(TM) 64-Bit Server VM v10.0.1+10
13:43:14.389 INFO  GenomicsDBImport - Start Date/Time: March 20, 2020 at 1:43:09 PM GMT-05:00
13:43:14.389 INFO  GenomicsDBImport - ------------------------------------------------------------
13:43:14.389 INFO  GenomicsDBImport - ------------------------------------------------------------
13:43:14.390 INFO  GenomicsDBImport - HTSJDK Version: 2.20.3
13:43:14.390 INFO  GenomicsDBImport - Picard Version: 2.21.1
13:43:14.390 INFO  GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 2
13:43:14.390 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
13:43:14.390 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : true
13:43:14.390 INFO  GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
13:43:14.390 INFO  GenomicsDBImport - Deflater: IntelDeflater
13:43:14.390 INFO  GenomicsDBImport - Inflater: IntelInflater
13:43:14.390 INFO  GenomicsDBImport - GCS max retries/reopens: 20
13:43:14.390 INFO  GenomicsDBImport - Requester pays: disabled
13:43:14.391 INFO  GenomicsDBImport - Initializing engine
13:44:18.385 INFO  IntervalArgumentCollection - Processing 48059334 bp from intervals
13:44:18.412 INFO  GenomicsDBImport - Done initializing engine
13:44:18.806 INFO  GenomicsDBImport - Shutting down engine
[March 20, 2020 at 1:44:18 PM GMT-05:00] done. Elapsed time: 1.16 minutes.

A USER ERROR has occurred: Error creating GenomicsDB workspace: /n/holyscratch01/edwards_lab/rafa/genomic_DBs/db_177 already exists

Thanks for any pointers!!!!

If you check the last line of the log the error is already mentioned.

A USER ERROR has occurred: Error creating GenomicsDB workspace: /n/holyscratch01/edwards_lab/rafa/genomic_DBs/db_177 already exists

You should consider a different naming strategy for the DB file.

Just remove the directory "db_177" and try again. But make sure genomics_DBs directory has been created.

6 months ago
yussab ▴ 30

I went through the GATK GenomicDBImport and this is the solution You can find all the useful information at the link below.


IMPORTANT: "The --genomicsdb-workspace-path must point to a non-existent or empty directory."

Remeber to set the post as solved if you've get the correct answer ;)


