Question

GATK memory error with Java

2

Entering edit mode

9 months ago

anikcropscience ▴ 230

Hello, I have genotype by sequencing data for 400 samples. I am trying to run a SNP calling pipeline using GATK. I could manage until HaplotypeCaller command in gatk. However, when I proceed with CombineGVCFs step to combine all the 400 g.vcf files into one, GATK fails to run with the following error:

Exception in thread "main" java.lang.OutOfMemoryError: Required array length 2147483639 + 9 is too large

Then, I created 8 subset lists with 50 g.vcf files in each. I used the following code just to combine 100 samples from two lists:

gatk --java-options  "-Xmx120G"  CombineGVCFs -R /mnt/tg_server/genotype_data/SNP_calling/NGS1752/Reference/New_wt_CP-MT.fasta -V intermediate_1.g.vcf -V intermediate_2.g.vcf -O combined_1_2.g.vcf

Still, I get the above-mentioned memory error message. I tried to increase the -Xmx value until 500G, but it did not resolve the error message. I am using a docker image of gatk and running in kubernetes. I also tried to allocate more cpus and memory in the kubernetes .yml file but nothing could resolve the error message.

Can you please provide a suggestion to resolve this issue? I thought of the GenomicImportDBI approach but I have a scaffolded reference genome with 120 scaffolds. So, going that way is more cumbersome.

VCF GATK SNP Java • 1.7k views

ADD COMMENT • link updated 6 months ago by Brian Bushnell 20k • written 9 months ago by anikcropscience ▴ 230

0

Entering edit mode

Even though it's reported as an OutOfMemoryError, it has nothing to do with memory. Java does not support arrays longer than 2147483647 (and the precise limit is somewhat lower, depending on the JVM implementation). So changing -Xmx won't fix it; you simply had too many variants.

ADD REPLY • link 6 months ago by Brian Bushnell 20k

score 2 · Answer 1 · 2023-08-01

2

Entering edit mode

9 months ago

Pierre Lindenbaum 161k

with CombineGVCFs step to combine all the 400

combine 20 g.VCFs 20 times: this will produces 20 intermediates g.vcf with 20 samples

combine those 20 g.VCFs to generate the final g.vcf

ADD COMMENT • link 9 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

Thank you for the suggestion.

ADD REPLY • link 9 months ago by anikcropscience ▴ 230

0

Entering edit mode

combine those 20 g.VCFs to generate the final g.vcf

Will this avoid running into the memory ceiling?

ADD REPLY • link 9 months ago by GenoMax 141k

0

Entering edit mode

I am not sure about this. Because in the end, I would have to combine those smaller combined gvcf file into the final one which also requires GATK CombineGVCFs.

ADD REPLY • link 9 months ago by anikcropscience ▴ 230

0

Entering edit mode

less variants, less VCF indexes loaded in memory, less variants iterators, etc...

ADD REPLY • link 9 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

GATK 3.8: https://sites.google.com/a/broadinstitute.org/legacy-gatk-forum-discussions/2019-02-11-2018-08-12/23355-Combine-multisample-GVCFs

The general recommendation is to group the multi-sample GVCF samples into equal numbers of samples and the GATK gives a rule-of-thumb of ~200 samples per grouping.

ADD REPLY • link 9 months ago by Pierre Lindenbaum 161k

score 1 · Answer 2 · 2023-08-02

Looking at the other answers, I highly recommend trying a few optimizations first:

Try to install and run GATK "natively" without the containerization, there could be some additional memory limitations imposed by the container layers. Containers are great and all for compatibility and reproducibility but there might also be implicit side effects. I am not saying there are but you should exclude the possibility. GATK can be installed via BioConda on many platforms.

When doing this, also install a more recent version (>=4). There might be some further optimizations.
IF none of these are an option or you still do not succeed, consider using the GenomicDBImport approach. The only complicated point is to build the command line options for the contigs, in comparison to the repeated joining approach presented above, I believe the GenomicDBImport way is simpler, more efficient and less error-prone. The other option overcomplicates the whole task without having any benefit.

Define a variable INTERVALS that looks like this:

INTERVALS="-L contig001 -L contig002 -L contigN"

And use it in your gatk call wrapper:

    JAVAOPT="-XX:ConcGCThreads=10 -XX:ParallelGCThreads=10 -Xmx200G -Djava.io.tmpdir=$TMPDIR"

REF=/path/to/scaffolds.fasta
### The following should work for most FASTA files
INTERVALS=$(cat $REF | grep -e"^>" | cut -f1 -d " " | sed "s/>/ -L /" | tr -d "\n")

V=""
for VCF in $@ ; do 
  ### Create an Index for each feature file
 gatk --java-options "$JAVAOPT" IndexFeatureFile --tmp-dir $TMPDIR -I $VCF 
 V="$V -V $VCF"
done

### Create gendb
gatk --java-options "$JAVAOPT" GenomicsDBImport \
     --tmp-dir $TMPDIR \
     $V \
     --genomicsdb-workspace-path $DB \
      $INTERVALS

You can also construct the interval variable automatically from the reference FASTA file. The script code contains a basic attempt to generate the INTERVALS option from the reference fasta.

score 0 · Answer 3 · 2023-08-02

0

Entering edit mode

8 months ago

raphael.B ▴ 520

You can also split your genome to work on smaller files. To have comparable complexity from on interval to another, you can generate these regions with GATK spliIntervals.