Question

GATK - GenomicsDBImport, ValidateVariant, CreateSequenceDictionary Issues

0

Entering edit mode

21 months ago

aheritas • 0

Hi all,

I would like to try GenomicsDB in a sample VCF. This VCF is obtained from an imputation software (Gencove) and it is not a g.VCF. I have tried the following code:

docker run -it broadinstitute/gatk:latest   

gatk GenomicsDBImport -V /data/genomicsdb/results/LUSE03081974/LUSE03081974.vcf.gz --genomicsdb-workspace-path my_database --intervals chr20,chr21

This gives the following error:

A USER ERROR has occurred: Failed to create reader from file:///data/genomicsdb/results/LUSE03081974/LUSE03081974.vcf.gz because of the following error:
    Unable to parse header with error: /data/genomicsdb/results/LUSE03081974/LUSE03081974.vcf.gz, for input source: file:///data/genomicsdb/results/LUSE03081974/LUSE03081974.vcf.gz

Some comments in GATK forums like here and here, suggest the following:

Double check that there is a .vcf.gz.tbi index file corresponding to each .vcf.gz file and that both the index file and VCF file were compressed.

Run ValidateVariants on the VCF file.

Run HaplotypeCaller

I have checked the first point and the result from ls /data/genomicsdb/results/LUSE03081974 is the following:

human_g1k_v37.fasta      
human_g1k_v37.fasta.fai     
LUSE03081974.vcf.gz
LUSE03081974.vcf.gz.csi
LUSE03081974.vcf.gz.tbi
LUSE03081974.vcf.gz.tbi.gz
LUSE03081974.bai        
LUSE03081974.bam

Therefore, I try the next two points, run ValidateVariants:

gatk ValidateVariants -R /data/genomicsdb/results/LUSE03081974/human_g1k_v37.fasta -V /data/genomicsdb/results/LUSE03081974/LUSE03081974.vcf.gz --validation-type-to-exclude ALL --java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true'

Which gives me an error:

    A USER ERROR has occurred: The specified fasta file (file:///data/genomicsdb/results/LUSE03081974/human_g1k_v37.fasta) does not exist.
***********************************************************************
org.broadinstitute.hellbender.exceptions.UserException$MissingReference: The specified fasta file (file:///data/genomicsdb/results/LUSE03081974/human_g1k_v37.fasta) does not exist.
    at org.broadinstitute.hellbender.utils.fasta.CachingIndexedFastaSequenceFile.checkFastaPath(CachingIndexedFastaSequenceFile.java:173)
        at org.broadinstitute.hellbender.utils.fasta.CachingIndexedFastaSequenceFile.<init>(CachingIndexedFastaSequenceFile.java:143)
        at org.broadinstitute.hellbender.utils.fasta.CachingIndexedFastaSequenceFile.<init>(CachingIndexedFastaSequenceFile.java:125)
        at org.broadinstitute.hellbender.utils.fasta.CachingIndexedFastaSequenceFile.<init>(CachingIndexedFastaSequenceFile.java:110)
        at org.broadinstitute.hellbender.engine.ReferenceFileSource.<init>(ReferenceFileSource.java:35)
        at org.broadinstitute.hellbender.engine.ReferenceDataSource.of(ReferenceDataSource.java:27)
        at org.broadinstitute.hellbender.engine.GATKTool.initializeReference(GATKTool.java:439)
        at org.broadinstitute.hellbender.engine.GATKTool.onStartup(GATKTool.java:722)
        at org.broadinstitute.hellbender.engine.VariantWalker.onStartup(VariantWalker.java:45)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:138)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
        at org.broadinstitute.hellbender.Main.main(Main.java:289)

So there appears to be a problem with the fasta file. From comments in this forum and GATK, the fasta files need to have their corresponding .fai index and .dict . So, I try to create an index using:

gatk CreateSequenceDictionary -R /data/genomicsdb/results/LUSE03081974/human_g1k_v37.fa

which produces a new error:

[Mon Jul 11 16:27:36 GMT 2022] picard.sam.CreateSequenceDictionary done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=1086849024
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
htsjdk.samtools.SAMException: Cannot write file: /data/genomicsdb/results/LUSE03081974/human_g1k_v37.dict. Neither file nor parent directory exist.
    at htsjdk.samtools.util.IOUtil.assertFileIsWritable(IOUtil.java:554)
    at picard.sam.CreateSequenceDictionary.doWork(CreateSequenceDictionary.java:223)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:308)
    at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:37)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)

So I am not sure where my problem is. I originally thought it could be that genomicsDB worked only with g.VCF and not VCFs, but then I ended up not being able to use ValidateVariants or CreateSequenceDictionary. It seems to me that I might have some issues with naming/referencing files, but I am not sure what is it.

Some additional information, if its useful: using gatk latest from docker and java version:

openjdk version "1.8.0_242"
OpenJDK Runtime Environment (build 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08)
OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)

I appreciate your help!

ValidateVariants HaplotypeCaller GATK GenomicsDB CreateSequenceDictionary • 1.3k views

ADD COMMENT • link 21 months ago by aheritas • 0

0

Entering edit mode

what is the output of

ls -lah /data/genomicsdb/results/LUSE03081974

ADD REPLY • link 21 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

total 8,1G
drwxrwxr-x 2 server server  4,0K jul 11 13:17 .
drwxrwxr-x 3 server server  4,0K jul 11 10:19 ..
-rw-rw-r-- 1 server server  3,0G jul 11 11:20 human_g1k_v37.fasta
-rw-rw-r-- 1 server server  2,7K jul 11 11:41 human_g1k_v37.fasta.fai
-rw-rw-r-- 1 server server   317 jul 11 10:33 LUSE03081974_ancestry-json.json
-rw-rw-r-- 1 server server  2,2M jul 11 10:46 LUSE03081974.bai
-rw-rw-r-- 1 server server  2,2G jul 11 10:57 LUSE03081974.bam
-rw-rw-r-- 1 server server  1,7M jul 11 10:33 LUSE03081974_cnv-cnr.cnr
-rw-rw-r-- 1 server server  282K jul 11 10:57 LUSE03081974_cnv-cns.cns
-rw-rw-r-- 1 server server   70K jul 11 10:46 LUSE03081974_cnv-png.png
-rw-rw-r-- 1 server server  999M jul 11 10:27 LUSE03081974_fastq-r1.fastq.gz
-rw-rw-r-- 1 server server 1007M jul 11 10:33 LUSE03081974_fastq-r2.fastq.gz
-rw-rw-r-- 1 server server    18 jul 11 10:19 LUSE03081974_metadata.json
-rw-rw-r-- 1 server server  2,4K jul 11 10:19 LUSE03081974_qc.json
-rw-rw-r-- 1 server server   338 jul 11 10:46 LUSE03081974_traits-json.json
-rw-rw-r-- 1 server server  1,1G jul 11 10:45 LUSE03081974.vcf.gz
-rw-rw-r-- 1 server server  1,7M jul 11 10:46 LUSE03081974.vcf.gz.csi
-rw-rw-r-- 1 server server  2,1M jul 11 10:57 LUSE03081974.vcf.gz.tbi
-rw-rw-r-- 1 server server  2,0M jul 11 13:17 LUSE03081974.vcf.gz.tbi.gz

ADD REPLY • link 21 months ago by aheritas • 0

score 1 · Accepted Answer · 2022-07-11

1

Entering edit mode

21 months ago

Pierre Lindenbaum 161k

docker run -it broadinstitute/gatk:latest

is /data/genomicsdb/results/LUSE03081974 visible from docker ?

ADD COMMENT • link 21 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

I suspect this might be the problem. I didn't know that I had to make my data visible to docker. Would you mind suggesting me how? (or pointing to any reference). Thank you.

ADD REPLY • link 21 months ago by aheritas • 0

0

Entering edit mode

Solved:

docker run -v /data/genomicsdb/results/LUSE03081974:/data -it broadinstitute/gatk:latest

Now the files are visible from docker, in the data directory. Thank you, Pierre!

ADD REPLY • link 21 months ago by aheritas • 0