Question

Help with error in GATK variant calling

0

Entering edit mode

11 months ago

Chris ▴ 260

Hi all,

I try some reference genome such as Homo_sapiens_assembly38.fasta and Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa but I still got the error below. Would you please have a suggestion? Thank you so much. The link in the error message doesn't work.

gatk BaseRecalibrator -I Library_1Aligned.out.sorted.bam -R /home/user/Homo_sapiens.GRCh38.dna_sm.primary_assembly
.fa --known-sites 1000G_phase1.snps.high_confidence.hg38.vcf.gz --known-sites Homo_sapiens_assembly38.known_indels.vcf.gz -O recal_data.table

A USER ERROR has occurred: Fasta dict file file:///oak/Homo_sapiens.GRCh38.dna_sm.primary_assembly.dict for reference file:///oak/Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa does not exist. 
Please see http://gatkforums.broadinstitute.org/discussion/1601/how-can-i-prepare-a-fasta-file-to-use-as-reference for help creating it.

GATK • 2.5k views

ADD COMMENT • link updated 11 months ago by Ram 43k • written 11 months ago by Chris ▴ 260

1

Entering edit mode

In my book I am writing a chapter on human genome variation calling by reproducing a published paper with different methods.

I found that bcftools with almost default settings, outperforms GATK lengthy and tedious "best practices". Moreover bcftools runs in a fraction of time and needs a small fraction of resources ...

In my opinion, the so-called "GATK best practices", marking duplicates, base recalibration, etc are a bit outdated, it is information that is being cited and referred to a lot on the account that it was the default approach at the Broad Institute. But the method is so complicated and has so many moving parts and as you note so obtuse and tedious to run.

If accuracy is of utmost importance and you have the computational resources then run the Google DeepVariant; it is substantially better than GATK. And simpler to run as well.

ADD REPLY • link 11 months ago by Istvan Albert 100k

0

Entering edit mode

Thank you for the suggestion! I tried nf-core/sarek but I got an error that the dev team still working on so I find tools that get the job done in the mean time.

ADD REPLY • link 11 months ago by Chris ▴ 260

0

Entering edit mode

ah yes, the most important advice is to stay away from nextflow and the like.

These workflow platforms were never designed to teach you how to run anything. As you yourself experienced, all you end up with is endless chasing around nonexisting documentation and fighting the platform instead of learning bioinformatics.

Workflow management platforms are to be used only once you know a bioinformatics process so well you are bored and annoyed that you must retype commands.

To learn a bioinformatics tool, look at the tool documentation, the deepvariant is exquisitely well documented.

ADD REPLY • link 11 months ago by Istvan Albert 100k

0

Entering edit mode

You said you used STAR. Are you working on RNAseq or DNAseq?

ADD REPLY • link 11 months ago by Ram 43k

Ram · Accepted Answer · 2023-05-09

2

Entering edit mode

11 months ago

Istvan Albert 100k

I agree that it is pretty pathetic when a link produced by a tool does not actually work, evidently it tripped up so many people that they added a link to the error message ... only that the link does not work anymore ... geez

what is even more pathetic and absurd is that creating that fasta dictionary is a trivial 15-second job that GATK could easily do itself ... but no OMG ... that would be too much to ask ... it rather complains that it cannot find the file then sends people to wild goose chases ... that is bioinformatics all right

... ok rant over ...

here is what you need to do (FWIW who knows how long this link is valid)

https://gatk.broadinstitute.org/hc/en-us/articles/360036729911-CreateSequenceDictionary-Picard-

just to make things even more confusing, picard may be installed separately and can be invoked as

# Install the picard tools
mamba install -c bioconda picard

# Run the sequence dictionary creation
picard CreateSequenceDictionary -R reference.fa -O reference.dict

ADD COMMENT • link 11 months ago by Istvan Albert 100k

0

Entering edit mode

Interesting reply! Thank you :) I got a new error:

A USER ERROR has occurred: An index is required but was not found for file /user/1000G_phase1.snps.high_confidence.hg38.vcf.gz. Support for unindexed block-compressed files has been temporarily disabled. Try running IndexFeatureFile on the input.

ADD REPLY • link updated 11 months ago by Ram 43k • written 11 months ago by Chris ▴ 260

1

Entering edit mode

Read the error. You need an index for the VCF file. Google that, index the VCF file and try the command again. Always read the freaking error message.

ADD REPLY • link 11 months ago by Ram 43k

0

Entering edit mode

Thank you for your reply. I know fastq file has an index file but surprised that vcf file also has an index file.

ADD REPLY • link 11 months ago by Chris ▴ 260

2

Entering edit mode

the fifth rule of bioinformatics:

5. every file will also have an index file

ADD REPLY • link 11 months ago by Istvan Albert 100k

0

Entering edit mode

Thank you for the advice. I made an index file for vcf recently but I forgot. Eventually, I don't have an error but I don't see the output file as expected:

reads contigs = [1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 21, 22, 3, 4, 5, 6, 7, 8, 9, MT, X, Y, KI270728.1, KI270727.1, KI270442.1, KI270729.1, GL0
00225.1, KI270743.1, GL000008.2, GL000009.2, KI270747.1, KI270722.1, GL000194.1, KI270742.1, GL000205.2, GL000195.1, KI270736.1, KI270733.1, GL000224.1, GL0002
19.1, KI270719.1, GL000216.2, KI270712.1, KI270706.1, KI270725.1, KI270744.1, KI270734.1, GL000213.1, GL000220.1, KI270715.1, GL000218.1, KI270749.1, KI270741.
1, GL000221.1, KI270716.1, KI270731.1, KI270751.1, KI270750.1, KI270519.1, GL000214.1, KI270708.1, KI270730.1, KI270438.1, KI270737.1, KI270721.1, KI270738.1, 
KI270748.1, KI270435.1, GL000208.1, KI270538.1, KI270756.1, KI270739.1, KI270757.1, KI270709.1, KI270746.1, KI270753.1, KI270589.1, KI270726.1, KI270735.1, KI2
70711.1, KI270745.1, KI270714.1, KI270732.1, KI270713.1, KI270754.1, KI270710.1, KI270717.1, KI270724.1, KI270720.1, KI270723.1, KI270718.1, KI270317.1, KI2707
40.1, KI270755.1, KI270707.1, KI270579.1, KI270752.1, KI270512.1, KI270322.1, GL000226.1, KI270311.1, KI270366.1, KI270511.1, KI270448.1, KI270521.1, KI270581.
1, KI270582.1, KI270515.1, KI270588.1, KI270591.1, KI270522.1, KI270507.1, KI270590.1, KI270584.1, KI270320.1, KI270382.1, KI270468.1, KI270467.1, KI270362.1, 
KI270517.1, KI270593.1, KI270528.1, KI270587.1, KI270364.1, KI270371.1, KI270333.1, KI270374.1, KI270411.1, KI270414.1, KI270510.1, KI270390.1, KI270375.1, KI2
70420.1, KI270509.1, KI270315.1, KI270302.1, KI270518.1, KI270530.1, KI270304.1, KI270418.1, KI270424.1, KI270417.1, KI270508.1, KI270303.1, KI270381.1, KI2705
29.1, KI270425.1, KI270396.1, KI270363.1, KI270386.1, KI270465.1, KI270383.1, KI270384.1, KI270330.1, KI270372.1, KI270548.1, KI270580.1, KI270387.1, KI270391.
1, KI270305.1, KI270373.1, KI270422.1, KI270316.1, KI270340.1, KI270338.1, KI270583.1, KI270334.1, KI270429.1, KI270393.1, KI270516.1, KI270389.1, KI270466.1, 
KI270388.1, KI270544.1, KI270310.1, KI270412.1, KI270395.1, KI270376.1, KI270337.1, KI270335.1, KI270378.1, KI270379.1, KI270329.1, KI270419.1, KI270336.1, KI2
70312.1, KI270539.1, KI270385.1, KI270423.1, KI270392.1, KI270394.1]
***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.

Would you please have a suggestion?

ADD REPLY • link 11 months ago by Chris ▴ 260

1

Entering edit mode

Yes, READ THE ERROR MESSAGE.

Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.

ADD REPLY • link 11 months ago by Ram 43k

0

Entering edit mode

Another freaking-out message that I am totally not familiar with.

gatk BaseRecalibrator -I /bam/Library_1Aligned.out.sorted.bam -R Homo_sapiens_assembly38.fasta --known-sit
es 1000G_phase1.snps.high_confidence.hg38.vcf.gz --known-sites Homo_sapiens_assembly38.known_indels.vcf.gz -O recal_data.table --java-options -DGATK_STACKTRACE
_ON_USER_EXCEPTION=true

or :

gatk BaseRecalibrator -I /bam/Library_1Aligned.out.sorted.bam -R Homo_sapiens_assembly38.fasta --known-sit
es 1000G_phase1.snps.high_confidence.hg38.vcf.gz --known-sites Homo_sapiens_assembly38.known_indels.vcf.gz -O recal_data.table --java-options -'DGATK_STACKTRACE
_ON_USER_EXCEPTION=true'

The command still doesn't print out the output. I got this:

reads contigs = [1, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 2, 20, 21, 22, 3, 4, 5, 6, 7, 8, 9, MT, X, Y, KI270728.1, KI270727.1, KI270442.1, KI270729.1, GL0
00225.1, KI270743.1, GL000008.2, GL000009.2, KI270747.1, KI270722.1, GL000194.1, KI270742.1, GL000205.2, GL000195.1, KI270736.1, KI270733.1, GL000224.1, GL0002
19.1, KI270719.1, GL000216.2, KI270712.1, KI270706.1, KI270725.1, KI270744.1, KI270734.1, GL000213.1, GL000220.1, KI270715.1, GL000218.1, KI270749.1, KI270741.
1, GL000221.1, KI270716.1, KI270731.1, KI270751.1, KI270750.1, KI270519.1, GL000214.1, KI270708.1, KI270730.1, KI270438.1, KI270737.1, KI270721.1, KI270738.1, 
KI270748.1, KI270435.1, GL000208.1, KI270538.1, KI270756.1, KI270739.1, KI270757.1, KI270709.1, KI270746.1, KI270753.1, KI270589.1, KI270726.1, KI270735.1, KI2
70711.1, KI270745.1, KI270714.1, KI270732.1, KI270713.1, KI270754.1, KI270710.1, KI270717.1, KI270724.1, KI270720.1, KI270723.1, KI270718.1, KI270317.1, KI2707
40.1, KI270755.1, KI270707.1, KI270579.1, KI270752.1, KI270512.1, KI270322.1, GL000226.1, KI270311.1, KI270366.1, KI270511.1, KI270448.1, KI270521.1, KI270581.
1, KI270582.1, KI270515.1, KI270588.1, KI270591.1, KI270522.1, KI270507.1, KI270590.1, KI270584.1, KI270320.1, KI270382.1, KI270468.1, KI270467.1, KI270362.1, 
KI270517.1, KI270593.1, KI270528.1, KI270587.1, KI270364.1, KI270371.1, KI270333.1, KI270374.1, KI270411.1, KI270414.1, KI270510.1, KI270390.1, KI270375.1, KI2
70420.1, KI270509.1, KI270315.1, KI270302.1, KI270518.1, KI270530.1, KI270304.1, KI270418.1, KI270424.1, KI270417.1, KI270508.1, KI270303.1, KI270381.1, KI2705
29.1, KI270425.1, KI270396.1, KI270363.1, KI270386.1, KI270465.1, KI270383.1, KI270384.1, KI270330.1, KI270372.1, KI270548.1, KI270580.1, KI270387.1, KI270391.
1, KI270305.1, KI270373.1, KI270422.1, KI270316.1, KI270340.1, KI270338.1, KI270583.1, KI270334.1, KI270429.1, KI270393.1, KI270516.1, KI270389.1, KI270466.1, 
KI270388.1, KI270544.1, KI270310.1, KI270412.1, KI270395.1, KI270376.1, KI270337.1, KI270335.1, KI270378.1, KI270379.1, KI270329.1, KI270419.1, KI270336.1, KI2
70312.1, KI270539.1, KI270385.1, KI270423.1, KI270392.1, KI270394.1]
        at org.broadinstitute.hellbender.utils.SequenceDictionaryUtils.validateDictionaries(SequenceDictionaryUtils.java:163)
        at org.broadinstitute.hellbender.utils.SequenceDictionaryUtils.validateDictionaries(SequenceDictionaryUtils.java:98)
        at org.broadinstitute.hellbender.engine.GATKTool.validateSequenceDictionaries(GATKTool.java:701)
        at org.broadinstitute.hellbender.engine.GATKTool.onStartup(GATKTool.java:643)
        at org.broadinstitute.hellbender.engine.ReadWalker.onStartup(ReadWalker.java:50)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.runTool(CommandLineProgram.java:137)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMainPostParseArgs(CommandLineProgram.java:192)
        at org.broadinstitute.hellbender.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:211)
        at org.broadinstitute.hellbender.Main.runCommandLineProgram(Main.java:160)
        at org.broadinstitute.hellbender.Main.mainEntry(Main.java:203)
        at org.broadinstitute.hellbender.Main.main(Main.java:289)

Would you please have a suggestion?

ADD REPLY • link 11 months ago by Chris ▴ 260

1

Entering edit mode

Your error message is missing a part - there needs to be something before the reads contigs. Run your command with 2>error.err added to the end and look at the content of error.err once the command ends (successfully or otherwise)

ADD REPLY • link 11 months ago by Ram 43k

0

Entering edit mode

You are right. Because the error message is so long so I just put the last part.

A USER ERROR has occurred: Input files reference and reads have incompatible contigs: No overlapping contigs found.

I try to run nf-core/sarek but I got an error that don't know how to fix so I decided to break down the pipeline to run each step.

ADD REPLY • link 11 months ago by Chris ▴ 260

1

Entering edit mode

Because the error message is so long so I just put the last part.

You left out the relevant part. Always give us the full error message, unless you're 100% sure you know which the relevant part is.

Read the error message, think and use Google, then guess what you need to check to see if the error message is indeed pointing to reality. Make your guess here and we'll steer you in the right direction. Please do not expect us to hand you answers.

ADD REPLY • link 11 months ago by Ram 43k

0

Entering edit mode

Thank you for the suggestion. I think the reason is the bam file I used aligned with different reference genome. The weird thing is the .dict file create from a different genomes very different in size (28kb vs 470kb). Should I align the fastq files by BWA instead of STAR?

ADD REPLY • link 11 months ago by Chris ▴ 260

1

Entering edit mode

Changing the aligner does not make sense as a solution to the problem of having different reference genomes. File size is not an indicator of anything significant. Dig deeper please.

ADD REPLY • link 11 months ago by Ram 43k

0

Entering edit mode

Thank you for the comment. I found these posts with similar error:

GATK BaseRecalibrator known-sites vcf file
GATK Input files reference and features have incompatible contigs: No overlapping contigs found.

So I should align my fastq files with the same reference genome? Maybe I should replace Homo_sapiens_assembly38.known_indels.vcf.gz by 00-All.vcf.gz

ADD REPLY • link 11 months ago by Chris ▴ 260

1

Entering edit mode

Realigning might not be necessary since you're at just the BaseRecalibrator step. Find the reference genome you used in the alignment and get the FA and DICT from that ref genome.

Known indels are used in variant calling, I don't think you're there yet. Plus, the two files you mention are probably not equivalent in terms of information content, so you may not be able to drop one for the other.

ADD REPLY • link 11 months ago by Ram 43k

0

Entering edit mode

Thank you so much for the follow-up! I have the same error.

This is the files I have in the running directory:

Homo_sapiens.GRCh38.dna_sm.primary_assembly.dict
Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa
Library_1Aligned.out.sorted.bam

I am not sure what you mean by getting the FA.

ADD REPLY • link 11 months ago by Chris ▴ 260

1

Entering edit mode

The .fa and .dict is what I meant. What was the fasta you aligned with?

ADD REPLY • link 11 months ago by Ram 43k

0

Entering edit mode

I aligned 4 files using STAR:

NS1_CSFP210000376-2a_HKGJ5DSX2_L1_1.fq.gz
NS1_CSFP210000376-2a_HKGJ5DSX2_L1_2.fq.gz
NS1_CSFP210000376-2a_HKGNMDSX2_L1_1.fq.gz
NS1_CSFP210000376-2a_HKGNMDSX2_L1_2.fq.gz

I created the .dict from Homo_sapiens.GRCh38.dna_sm.primary_assembly.fa and that is the .fa (reference genome I used). We are running out of space for replies.

ADD REPLY • link 11 months ago by Chris ▴ 260