GATK memory error with Java
3
2
Entering edit mode
9 months ago

Hello, I have genotype by sequencing data for 400 samples. I am trying to run a SNP calling pipeline using GATK. I could manage until HaplotypeCaller command in gatk. However, when I proceed with CombineGVCFs step to combine all the 400 g.vcf files into one, GATK fails to run with the following error:

Exception in thread "main" java.lang.OutOfMemoryError: Required array length 2147483639 + 9 is too large

Then, I created 8 subset lists with 50 g.vcf files in each. I used the following code just to combine 100 samples from two lists:

gatk --java-options  "-Xmx120G"  CombineGVCFs -R /mnt/tg_server/genotype_data/SNP_calling/NGS1752/Reference/New_wt_CP-MT.fasta -V intermediate_1.g.vcf -V intermediate_2.g.vcf -O combined_1_2.g.vcf 

Still, I get the above-mentioned memory error message. I tried to increase the -Xmx value until 500G, but it did not resolve the error message. I am using a docker image of gatk and running in kubernetes. I also tried to allocate more cpus and memory in the kubernetes .yml file but nothing could resolve the error message.

Can you please provide a suggestion to resolve this issue? I thought of the GenomicImportDBI approach but I have a scaffolded reference genome with 120 scaffolds. So, going that way is more cumbersome.

VCF GATK SNP Java • 1.7k views
ADD COMMENT
0
Entering edit mode

Even though it's reported as an OutOfMemoryError, it has nothing to do with memory. Java does not support arrays longer than 2147483647 (and the precise limit is somewhat lower, depending on the JVM implementation). So changing -Xmx won't fix it; you simply had too many variants.

ADD REPLY
2
Entering edit mode
9 months ago

with CombineGVCFs step to combine all the 400

combine 20 g.VCFs 20 times: this will produces 20 intermediates g.vcf with 20 samples

combine those 20 g.VCFs to generate the final g.vcf

ADD COMMENT
0
Entering edit mode

Thank you for the suggestion.

ADD REPLY
0
Entering edit mode

combine those 20 g.VCFs to generate the final g.vcf

Will this avoid running into the memory ceiling?

ADD REPLY
0
Entering edit mode

I am not sure about this. Because in the end, I would have to combine those smaller combined gvcf file into the final one which also requires GATK CombineGVCFs.

ADD REPLY
0
Entering edit mode

less variants, less VCF indexes loaded in memory, less variants iterators, etc...

ADD REPLY
0
Entering edit mode

GATK 3.8: https://sites.google.com/a/broadinstitute.org/legacy-gatk-forum-discussions/2019-02-11-2018-08-12/23355-Combine-multisample-GVCFs

The general recommendation is to group the multi-sample GVCF samples into equal numbers of samples and the GATK gives a rule-of-thumb of ~200 samples per grouping.

ADD REPLY
1
Entering edit mode
8 months ago
Michael 54k

Looking at the other answers, I highly recommend trying a few optimizations first:

  1. Try to install and run GATK "natively" without the containerization, there could be some additional memory limitations imposed by the container layers. Containers are great and all for compatibility and reproducibility but there might also be implicit side effects. I am not saying there are but you should exclude the possibility. GATK can be installed via BioConda on many platforms.
  1. When doing this, also install a more recent version (>=4). There might be some further optimizations.

  2. IF none of these are an option or you still do not succeed, consider using the GenomicDBImport approach. The only complicated point is to build the command line options for the contigs, in comparison to the repeated joining approach presented above, I believe the GenomicDBImport way is simpler, more efficient and less error-prone. The other option overcomplicates the whole task without having any benefit.

Define a variable INTERVALS that looks like this:

INTERVALS="-L contig001 -L contig002 -L contigN"

And use it in your gatk call wrapper:

    JAVAOPT="-XX:ConcGCThreads=10 -XX:ParallelGCThreads=10 -Xmx200G -Djava.io.tmpdir=$TMPDIR"

REF=/path/to/scaffolds.fasta
### The following should work for most FASTA files
INTERVALS=$(cat $REF | grep -e"^>" | cut -f1 -d " " | sed "s/>/ -L /" | tr -d "\n")

V=""
for VCF in $@ ; do 
  ### Create an Index for each feature file
 gatk --java-options "$JAVAOPT" IndexFeatureFile --tmp-dir $TMPDIR -I $VCF 
 V="$V -V $VCF"
done

### Create gendb
gatk --java-options "$JAVAOPT" GenomicsDBImport \
     --tmp-dir $TMPDIR \
     $V \
     --genomicsdb-workspace-path $DB \
      $INTERVALS

You can also construct the interval variable automatically from the reference FASTA file. The script code contains a basic attempt to generate the INTERVALS option from the reference fasta.

ADD COMMENT
0
Entering edit mode

Thank you for this valuable suggestion. Is this necessary?

JAVAOPT="-XX:ConcGCThreads=10 -XX:ParallelGCThreads=10 -Xmx200G -Djava.io.tmpdir=$TMPDIR"
ADD REPLY
0
Entering edit mode

Not really, it is just what I am using, and I like to keep the java options in one place. The GC settings should speed up the process slightly, I also like to set TMPDIR to something bigger than our /tmp partition. But you can simply set JAVAOPT="Xmx500G" or whatever your RAM is or just set it empty.

ADD REPLY
0
Entering edit mode

Ok, thank you very much. I will try.

ADD REPLY
0
Entering edit mode
8 months ago
raphael.B ▴ 520

You can also split your genome to work on smaller files. To have comparable complexity from on interval to another, you can generate these regions with GATK spliIntervals.

ADD COMMENT
0
Entering edit mode

I see. Thank you. I will definitely give it a try.

ADD REPLY

Login before adding your answer.

Traffic: 1533 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6