GATK variant calling error
1
0
Entering edit mode
12 months ago
endretoth ▴ 30

Hi,

I have a trouble with variant calling and need help of someone who knows GATK. In a population study, I have mapped, with BWA-MEM, several loci onto a reference, which is one long fasta sequence (not chromosomes). After, I sorted and converted the SAM files so now I have BAM files, for all my samples.

I would like to call variants, only SNPs, from my samples using GATK. I would like to end up with a VCF with SNPs from all samples (I hope that GATK can match SNP positions among multiple samples).

So, I installed GATK 4.1.9.0, and created my command (it is very simple):

./gatk-4.1.9.0/gatk --java-options HaplotypeCaller ‐R ./reference.fasta ‐I ./samples/*.bam ‐O ./snps_output.vcf


I also tried:

./gatk-4.1.9.0/gatk --java-options "-Xmx4g" HaplotypeCaller ‐R ./reference.fasta ‐I ./samples/*.bam ‐o ./snps_output.vcf


with know luck...

Error of the first command:

Error: Could not find or load main class HaplotypeCaller
Caused by: java.lang.ClassNotFoundException: HaplotypeCaller


Error of the second command:

A USER ERROR has occurred: Illegal argument value: Positional arguments were provided ',‐R{./reference.fasta{‐I{./samples/AL1-1_sorted.bam{./samples/BU1-1_sorted.bam{‐o{./snps_output.vcf}' but no positional argument is defined for this tool.


Since this is my first time with GATK please be gentle :D Thanks!

GATK SNP variant calling • 1.1k views
0
Entering edit mode
12 months ago

in

./gatk-4.1.9.0/gatk --java-options "-Xmx4g" HaplotypeCaller ‐R ./reference.fasta ‐I ./samples/*.bam ‐o ./snps_output.vcf


‐I ./samples/*.bam won't work. You need to put a -I before each bam -I in1.bam -I in2.bam -I in3.bam

or put the path to your bams in a file with the suffix .list

find ./samples/ -type f -name "*.bam" > in.list
./gatk-4.1.9.0/gatk --java-options "-Xmx4g" HaplotypeCaller ‐R ./reference.fasta ‐I in.list ‐o ./snps_output.vcf ....

0
Entering edit mode

Dear Pierre, I have tried your second suggestion, because I need the same snp positions in all samples(this is a population genetic study). Unfortunately, it didn't work.

My command was:

find ./samples/ -type f -name "*.bam" > in.list
./gatk-4.1.9.0/gatk --java-options "-Xmx4g" HaplotypeCaller ‐R ./reference.fasta ‐I in.list ‐o ./snps_output.vcf


My result:

***********************************************************************

A USER ERROR has occurred: Illegal argument value: Positional arguments were provided ',‐R{./reference.fasta{‐I{in.list{‐o{./snps_output.vcf}' but no positional argument is defined for this tool.

***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.

0
Entering edit mode

Can HC take multiple inputs? Not a GATK user here but I think it can only take a single sample per run to produce an intermediate gVCF which then has to be postprocess for joint variant calling, no? What Pierre said :)

0
Entering edit mode

Can HC take multiple inputs?

yes. One can generate a VCF (not gvcf) directly from a small set of BAMs

0
Entering edit mode

you want option -O not -o

0
Entering edit mode

Hi Pierre,

I have changed -o to -O but the error still present:

***********************************************************************
A USER ERROR has occurred: Illegal argument value: Positional arguments were provided ',‐R{./reference.fasta{‐I{in.list{‐O{./snps_output.vcf}' but no positional argument is defined for this tool.
***********************************************************************
Set the system property GATK_STACKTRACE_ON_USER_EXCEPTION (--java-options '-DGATK_STACKTRACE_ON_USER_EXCEPTION=true') to print the stack trace.


Also, I have tried Freebayes for the same purpose but that cannot use multiple populations/samples

freebayes -f reference.fasta --bam-list in.list >var.vcf


this creates a vcf, but samples are combined...(I have checked in IGV viewer and there I cannot see multiple samples) I have no idea why :( It seems I stuck completely :(

0
Entering edit mode

it works on my machine:

 gatk --java-options "-Xmx500m" HaplotypeCaller -R rotavirus_rf.fa -I S1.bam -I S2.bam -O ~/jeter.vcf.gz
(...)
16:39:57.296 INFO  SmithWatermanAligner - Total compute time in java Smith-Waterman : 2.00 sec
16:39:57.297 INFO  HaplotypeCaller - Shutting down engine
[October 13, 2020 4:39:57 PM CEST] org.broadinstitute.hellbender.tools.walkers.haplotypecaller.HaplotypeCaller done. Elapsed time: 0.13 minutes.
Runtime.totalMemory()=508559360


check your really have whitespaces between your arguments. For example put a echo in front of the command line and test with file - . MUST be ASCII text.

echo gatk --java-options "-Xmx500m" HaplotypeCaller -R rotavirus_rf.fa -I S1.bam -I S2.bam -O ~/jeter.vcf.gz | file -
/dev/stdin: ASCII text

0
Entering edit mode

Hi Pierre,

Unfortunately, the command still doesn't work. However, I think I found the source of the problem.

My reference is built from a library of EST sequences which were Blasted on reference species genome to recover the "full" corresponding gene. These DNA sequences were filtered to have a proper length, larger than the sequences what we want to map (paired-end reads from RADseq, ~300bp length). Reconstructed gene sequences, used as a reference, were also concatenated to have one long sequence for mapping. After, I used BWA-MEM to map about 300,000 RAD loci on this reference. I did not recognize in time that reference has some duplicated sequences, although when we Blasted, at the beginning, we kept only the very best hits.

This probably the source of error, because I recently found (What Does The Zero Mapping Quality Mean For The Bwa Mapper?) that "region is duplicated then all the reads mapping there will also map elsewhere" and this result in "Zero mapping quality" and "Variant callers ignore reads with mapping quality zero". I have checked my BAM files in IGV and almost all have zero quality...probably GATK cannot call SNPs from zero quality mapping.

Now, probably, I will go back to the Blast results and filter sequences that are duplicated. Long job. May I kindly ask your opinion? The original plan was to filter SNPs from those RAD loci that are positioned in specific genes (drought stress responsive genes: so map RAD loci on genes and after call SNPs, in every individual and populations).

https://freeimage.host/i/2bcMyN

0
Entering edit mode

that wouldn't explain the error message: A USER ERROR has occurred: Illegal argument value: Positional arguments were provided

0
Entering edit mode

You are right but at the moment I have no idea what else can cause the problem. I was thinking maybe, if there are only very few acceptable loci, GATK cannot hadle this exception, but I don't know :( Also, in IGV I see some good quality loci and some, very few, SNPs in it... so I don't know why.

May I kindly ask you to look at my data (it is very small)?