GATK - GenomicsDBImport, ValidateVariant, CreateSequenceDictionary Issues
1
0
Entering edit mode
6 months ago
aheritas • 0

Hi all,

I would like to try GenomicsDB in a sample VCF. This VCF is obtained from an imputation software (Gencove) and it is not a g.VCF. I have tried the following code:

docker run -it broadinstitute/gatk:latest   

gatk GenomicsDBImport -V /data/genomicsdb/results/LUSE03081974/LUSE03081974.vcf.gz --genomicsdb-workspace-path my_database --intervals chr20,chr21  

This gives the following error:

A USER ERROR has occurred: Failed to create reader from file:///data/genomicsdb/results/LUSE03081974/LUSE03081974.vcf.gz because of the following error:
    Unable to parse header with error: /data/genomicsdb/results/LUSE03081974/LUSE03081974.vcf.gz, for input source: file:///data/genomicsdb/results/LUSE03081974/LUSE03081974.vcf.gz

Some comments in GATK forums like here and here, suggest the following:

  1. Double check that there is a .vcf.gz.tbi index file corresponding to each .vcf.gz file and that both the index file and VCF file were compressed.
    1. Run ValidateVariants on the VCF file.
    2. Run HaplotypeCaller

I have checked the first point and the result from ls /data/genomicsdb/results/LUSE03081974 is the following:

human_g1k_v37.fasta      
human_g1k_v37.fasta.fai     
LUSE03081974.vcf.gz
LUSE03081974.vcf.gz.csi
LUSE03081974.vcf.gz.tbi
LUSE03081974.vcf.gz.tbi.gz
LUSE03081974.bai        
LUSE03081974.bam        

Therefore, I try the next two points, run ValidateVariants:

gatk ValidateVariants -R /data/genomicsdb/results/LUSE03081974/human_g1k_v37.fasta -V /data/genomicsdb/results/LUSE03081974/LUSE03081974.vcf.gz --validation-type-to-exclude ALL --java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true'

Which gives me an error:

    A USER ERROR has occurred: The specified fasta file (file:///data/genomicsdb/results/LUSE03081974/human_g1k_v37.fasta) does not exist.
***********************************************************************
org.broadinstitute.hellbender.exceptions.UserException$MissingReference: The specified fasta file (file:///data/genomicsdb/results/LUSE03081974/human_g1k_v37.fasta) does not exist.
    at org.broadinstitute.hellbender.utils.fasta.CachingIndexedFastaSequenceFile.checkFastaPath(CachingIndexedFastaSequenceFile.java:173)
        at org.broadinstitute.hellbender.utils.fasta.CachingIndexedFastaSequenceFile.<init>(CachingIndexedFastaSequenceFile.java:143)
        at org.broadinstitute.hellbender.utils.fasta.CachingIndexedFastaSequenceFile.<init>(CachingIndexedFastaSequenceFile.java:125)
        at org.broadinstitute.hellbender.utils.fasta.CachingIndexedFastaSequenceFile.<init>(CachingIndexedFastaSequenceFile.java:110)
        at org.broadinstitute.hellbender.engine.ReferenceFileSource.<init>(ReferenceFileSource.java:35)
        at org.broadinstitute.hellbender.engine.ReferenceDataSource.of(ReferenceDataSource.java:27)
        at org.broadinstitute.hellbender.engine.GATKTool.initializeReference(GATKTool.java:439)
        at org.broadinstitute.hellbender.engine.GATKTool.onStartup(GATKTool.java:722)
        at org.broadinstitute.hellbender.engine.VariantWalker.onStartup(VariantWalker.java:45)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:138)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
        at org.broadinstitute.hellbender.Main.main(Main.java:289)

So there appears to be a problem with the fasta file. From comments in this forum and GATK, the fasta files need to have their corresponding .fai index and .dict . So, I try to create an index using:

gatk CreateSequenceDictionary -R /data/genomicsdb/results/LUSE03081974/human_g1k_v37.fa  

which produces a new error:

[Mon Jul 11 16:27:36 GMT 2022] picard.sam.CreateSequenceDictionary done. Elapsed time: 0.00 minutes.
Runtime.totalMemory()=1086849024
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
htsjdk.samtools.SAMException: Cannot write file: /data/genomicsdb/results/LUSE03081974/human_g1k_v37.dict. Neither file nor parent directory exist.
    at htsjdk.samtools.util.IOUtil.assertFileIsWritable(IOUtil.java:554)
    at picard.sam.CreateSequenceDictionary.doWork(CreateSequenceDictionary.java:223)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:308)
    at org.broadinstitute.hellbender.cmdline.PicardCommandLineProgramExecutor.instanceMain(PicardCommandLineProgramExecutor.java:37)
    at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
    at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
    at org.broadinstitute.hellbender.Main.main(Main.java:289)

So I am not sure where my problem is. I originally thought it could be that genomicsDB worked only with g.VCF and not VCFs, but then I ended up not being able to use ValidateVariants or CreateSequenceDictionary. It seems to me that I might have some issues with naming/referencing files, but I am not sure what is it.

Some additional information, if its useful: using gatk latest from docker and java version:

openjdk version "1.8.0_242"
OpenJDK Runtime Environment (build 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08)
OpenJDK 64-Bit Server VM (build 25.242-b08, mixed mode)

I appreciate your help!

ValidateVariants HaplotypeCaller GATK GenomicsDB CreateSequenceDictionary • 520 views
ADD COMMENT
0
Entering edit mode

what is the output of

ls -lah /data/genomicsdb/results/LUSE03081974
ADD REPLY
0
Entering edit mode
total 8,1G
drwxrwxr-x 2 server server  4,0K jul 11 13:17 .
drwxrwxr-x 3 server server  4,0K jul 11 10:19 ..
-rw-rw-r-- 1 server server  3,0G jul 11 11:20 human_g1k_v37.fasta
-rw-rw-r-- 1 server server  2,7K jul 11 11:41 human_g1k_v37.fasta.fai
-rw-rw-r-- 1 server server   317 jul 11 10:33 LUSE03081974_ancestry-json.json
-rw-rw-r-- 1 server server  2,2M jul 11 10:46 LUSE03081974.bai
-rw-rw-r-- 1 server server  2,2G jul 11 10:57 LUSE03081974.bam
-rw-rw-r-- 1 server server  1,7M jul 11 10:33 LUSE03081974_cnv-cnr.cnr
-rw-rw-r-- 1 server server  282K jul 11 10:57 LUSE03081974_cnv-cns.cns
-rw-rw-r-- 1 server server   70K jul 11 10:46 LUSE03081974_cnv-png.png
-rw-rw-r-- 1 server server  999M jul 11 10:27 LUSE03081974_fastq-r1.fastq.gz
-rw-rw-r-- 1 server server 1007M jul 11 10:33 LUSE03081974_fastq-r2.fastq.gz
-rw-rw-r-- 1 server server    18 jul 11 10:19 LUSE03081974_metadata.json
-rw-rw-r-- 1 server server  2,4K jul 11 10:19 LUSE03081974_qc.json
-rw-rw-r-- 1 server server   338 jul 11 10:46 LUSE03081974_traits-json.json
-rw-rw-r-- 1 server server  1,1G jul 11 10:45 LUSE03081974.vcf.gz
-rw-rw-r-- 1 server server  1,7M jul 11 10:46 LUSE03081974.vcf.gz.csi
-rw-rw-r-- 1 server server  2,1M jul 11 10:57 LUSE03081974.vcf.gz.tbi
-rw-rw-r-- 1 server server  2,0M jul 11 13:17 LUSE03081974.vcf.gz.tbi.gz
ADD REPLY
1
Entering edit mode
6 months ago

docker run -it broadinstitute/gatk:latest

is /data/genomicsdb/results/LUSE03081974 visible from docker ?

ADD COMMENT
0
Entering edit mode

I suspect this might be the problem. I didn't know that I had to make my data visible to docker. Would you mind suggesting me how? (or pointing to any reference). Thank you.

ADD REPLY
0
Entering edit mode

Solved:

docker run -v /data/genomicsdb/results/LUSE03081974:/data -it broadinstitute/gatk:latest

Now the files are visible from docker, in the data directory. Thank you, Pierre!

ADD REPLY

Login before adding your answer.

Traffic: 1365 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6