Hello,
I'm trying to combine 6 GVCF files into a single VCF file using GATK4 GenomicsDBImport + GenotypeGVCFs. I'm still using GATK4.beta5, since I'm not able to find the link to GATK4.beta6.
My target.interval file looks like this:
1:8022745-8023035
1:8025283-8025585
1:8029304-8029564
If I run the following command:
gatk-launch GenomicsDBImport \
-V sample_001.gvcf.gz \
-V sample_002.gvcf.gz \
-V sample_003.gvcf.gz \
-V sample_004.gvcf.gz \
-V sample_005.gvcf.gz \
-V sample_006.gvcf.gz \
--genomicsDBWorkspace testDB \
--L 1:8022745-8023035
GATK will work and generate a database with genotypes for this specific interval.
However, if I run the following script:
#!/bin/bash
while read in; do \
gatk-launch GenomicsDBImport \
-V sample_001.gvcf.gz \
-V sample_002.gvcf.gz \
-V sample_003.gvcf.gz \
-V sample_004.gvcf.gz \
-V sample_005.gvcf.gz \
-V sample_006.gvcf.gz \
--genomicsDBWorkspace testDB \
--L "$in"; done < target.interval
GATK runs, but I get the following error for every single interval and nothing is added to the database:
: Problem parsing start/end value in interval string. Value was: 8023035arse Genome Location string: 1:8022745-8023035
: Problem parsing start/end value in interval string. Value was: 8025585arse Genome Location string: 1:8025283-8025585
I understand that the software is recognizing each line of my file correctly but, for some reason, it cannot parse the start and end position of each interval.
I also tried to use a regular bed file as input file with the following format:
1 8022745 8023035
1 8025283 8025585
1 8029304 8029564
In that case the error was the following one:
A USER ERROR has occurred: Badly formed genome unclippedLoc: Contig '1 8022745 8023035' does not match any contig in the GATK sequence dictionary derived from the reference; are you sure you are using the correct reference fasta file?
Can anybody help me with this issue?
Thank you very much,
Here you have the entire error message for the first two intervals:
09:50:28.057 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/mnt/opt/gatk-4.beta.5/gatk-package-4.beta.5-local.jar!/com/inte l/gkl/native/libgkl_compression.so
[December 15, 2017 9:50:27 AM PST] GenomicsDBImport --genomicsDBWorkspace SAmerican_panel --variant 17-50618.gatk.gvcf.gz --variant 17-50619.gatk.gvcf.gz --varian t 17-50620.gatk.gvcf.gz --variant 17-50621.gatk.gvcf.gz --variant 17-50622.gatk.gvcf.gz --variant 17-50623.gatk.gvcf.gz --variant 17-50624.gatk.gvcf.gz --variant 1 7-50625.gatk.gvcf.gz --variant 17-50626.gatk.gvcf.gz --variant 17-50627.gatk.gvcf.gz --variant 17-50628.gatk.gvcf.gz --variant 17-50629.gatk.gvcf.gz --variant 17-5 0630.gatk.gvcf.gz --variant 17-50631.gatk.gvcf.gz --variant 17-50632.gatk.gvcf.gz --variant 17-50633.gatk.gvcf.gz --variant 17-50634.gatk.gvcf.gz --variant 17-5063 --genomicsDBSegmentSize 1048576 --genomicsDBVCFBufferSize 16384 --overwriteExistingGenomicsDBWorkspace false --batchSize 0 --consolidate false --validateSampleNa meMap false --readerThreads 1 --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --interval_merging_rule ALL --readValidationStringency SILENT --secondsBetweenProgressUpdates 10.0 --disableSequenceDictionaryValidation false --createOutputBamIndex true --createOutputBamMD5 false --createOutputVarian tIndex true --createOutputVariantMD5 false --lenient false --addOutputSAMProgramRecord true --addOutputVCFCommandLine true --cloudPrefetchBuffer 0 --cloudIndexPref etchBuffer 0 --disableBamIndexCaching false --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_infla ter false --gcs_max_retries 20 --disableToolDefaultReadFilters false
[December 15, 2017 9:50:27 AM PST] Executing as user@server on Linux 3.10.0-693.2.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_151-b12; Version: 4.beta .5
09:50:28.443 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 5
09:50:28.443 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
09:50:28.443 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : false
09:50:28.444 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
09:50:28.444 INFO GenomicsDBImport - Deflater: IntelDeflater
09:50:28.444 INFO GenomicsDBImport - Inflater: IntelInflater
09:50:28.444 INFO GenomicsDBImport - GCS max retries/reopens: 20
09:50:28.444 INFO GenomicsDBImport - Using google-cloud-java patch c035098b5e62cb4fe9155eff07ce88449a361f5d from https://github.com/droazen/google-cloud-java/tree /dr_all_nio_fixes
09:50:28.444 INFO GenomicsDBImport - Initializing engine
09:50:38.228 INFO GenomicsDBImport - Shutting down engine
[December 15, 2017 9:50:38 AM PST] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.17 minutes.
Runtime.totalMemory()=1368915968
***********************************************************************
: Problem parsing start/end value in interval string. Value was: 8023035arse Genome Location string: 1:8022745-8023035
***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--javaOptions '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.
09:50:47.419 INFO NativeLibraryLoader - Loading libgkl_compression.so from jar:file:/mnt/opt/gatk-4.beta.5/gatk-package-4.beta.5-local.jar!/com/inte l/gkl/native/libgkl_compression.so
[December 15, 2017 9:50:47 AM PST] GenomicsDBImport --genomicsDBWorkspace SAmerican_panel --variant 17-50618.gatk.gvcf.gz --variant 17-50619.gatk.gvcf.gz --varian t 17-50620.gatk.gvcf.gz --variant 17-50621.gatk.gvcf.gz --variant 17-50622.gatk.gvcf.gz --variant 17-50623.gatk.gvcf.gz --variant 17-50624.gatk.gvcf.gz --variant 1 7-50625.gatk.gvcf.gz --variant 17-50626.gatk.gvcf.gz --variant 17-50627.gatk.gvcf.gz --variant 17-50628.gatk.gvcf.gz --variant 17-50629.gatk.gvcf.gz --variant 17-5 0630.gatk.gvcf.gz --variant 17-50631.gatk.gvcf.gz --variant 17-50632.gatk.gvcf.gz --variant 17-50633.gatk.gvcf.gz --variant 17-50634.gatk.gvcf.gz --variant 17-5063 --genomicsDBSegmentSize 1048576 --genomicsDBVCFBufferSize 16384 --overwriteExistingGenomicsDBWorkspace false --batchSize 0 --consolidate false --validateSampleNa meMap false --readerThreads 1 --interval_set_rule UNION --interval_padding 0 --interval_exclusion_padding 0 --interval_merging_rule ALL --readValidationStringency SILENT --secondsBetweenProgressUpdates 10.0 --disableSequenceDictionaryValidation false --createOutputBamIndex true --createOutputBamMD5 false --createOutputVarian tIndex true --createOutputVariantMD5 false --lenient false --addOutputSAMProgramRecord true --addOutputVCFCommandLine true --cloudPrefetchBuffer 0 --cloudIndexPref etchBuffer 0 --disableBamIndexCaching false --help false --version false --showHidden false --verbosity INFO --QUIET false --use_jdk_deflater false --use_jdk_infla ter false --gcs_max_retries 20 --disableToolDefaultReadFilters false
[December 15, 2017 9:50:47 AM PST] Executing as user@server on Linux 3.10.0-693.2.2.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_151-b12; Version: 4.beta .5
09:50:47.976 INFO GenomicsDBImport - HTSJDK Defaults.COMPRESSION_LEVEL : 5
09:50:47.976 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_READ_FOR_SAMTOOLS : false
09:50:47.976 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_SAMTOOLS : false
09:50:47.976 INFO GenomicsDBImport - HTSJDK Defaults.USE_ASYNC_IO_WRITE_FOR_TRIBBLE : false
09:50:47.976 INFO GenomicsDBImport - Deflater: IntelDeflater
09:50:47.976 INFO GenomicsDBImport - Inflater: IntelInflater
09:50:47.976 INFO GenomicsDBImport - GCS max retries/reopens: 20
09:50:47.976 INFO GenomicsDBImport - Using google-cloud-java patch c035098b5e62cb4fe9155eff07ce88449a361f5d from https://github.com/droazen/google-cloud-java/tree /dr_all_nio_fixes
09:50:47.976 INFO GenomicsDBImport - Initializing engine
09:50:53.175 INFO GenomicsDBImport - Shutting down engine
[December 15, 2017 9:50:53 AM PST] org.broadinstitute.hellbender.tools.genomicsdb.GenomicsDBImport done. Elapsed time: 0.10 minutes.
Runtime.totalMemory()=1389887488
***********************************************************************
: Problem parsing start/end value in interval string. Value was: 8025585arse Genome Location string: 1:8025283-8025585
***********************************************************************
Hello,
I have tried to handle this issue with a WDL + CROMWELL script, but I am still facing a problem. I will detail my input files and my script to see if you can help me solve the problem.
I am running the following command:
My
application.conf
looks like this:my.json
file looks like this:my.wdl
script is the following one:The error that I am getting with this script is the following one:
If I specified the 'sep' option I would have several input values for the GenotypeGVCFs -V option and I would get the following error:
My main problem is that the output of the
merge_gvcfs
task is an array of files andGenotypeGVCFs -V
option allows only a single input value. Even if I flattened themerge_gvcfs
array into a string, I would still have several input values for thegenerate_vcf
task.Is there any way to combine into a single folder the multiple output folders from the scatter function that loops through the intervals file list? This way I would be able to provide a single input value for the
GenotypeGVCFs
command.Thank you very much for your help,
Best,
Yatros
what is the output of
?
The output from this grep is the following one: 1#8022745#8023035
I do not think that GenomicsDBImport is recognizing the input file in the right way. And I think that the input file should have the following format 1:8022745-8023035.
When I do a script using with WDL+CROMWELL with the input file using the format 1:8022745-8023035, the script runs without any problem, but it returns a different folder for every single interval and it is tough to merge them afterwards with GenotypeGVCFs, since this option from GATK allows only one input folder at a time, not several ones.
Thanks
don't you use a chromosome notation like "
1
" while the reference is usingchr1
?Dear Pierre-Lindenbaum. I already checked this. My reference file has no 'chr' notation. This is how the first line of my .dict file looks like:
Thanks anyway
By the way, I already switched to GATK4.beta.6 and the problem with this script persists,
Thanks,
Did you try getting in touch with the GATK team? They're super helpful in general, plus they're actively seeking feedback on GATK4.