Hello,
I have genotype by sequencing data for 400 samples. I am trying to run a SNP calling pipeline using GATK. I could manage until HaplotypeCaller command in gatk. However, when I proceed with CombineGVCFs step to combine all the 400 g.vcf files into one, GATK fails to run with the following error:
Exception in thread "main" java.lang.OutOfMemoryError: Required array length 2147483639 + 9 is too large
Then, I created 8 subset lists with 50 g.vcf files in each. I used the following code just to combine 100 samples from two lists:
gatk --java-options "-Xmx120G" CombineGVCFs -R /mnt/SNP_calling/Reference/genome.fasta -V intermediate_1.g.vcf -V intermediate_2.g.vcf -O combined_1_2.g.vcf
Still, I get the above-mentioned memory error message. I tried to increase the -Xmx value until 500G, but it did not resolve the error message.
I am using a docker image of gatk.
Can you please provide a suggestion to resolve this issue? I thought of the GenomicImportDBI approach but I have a scaffolded reference genome with 120 scaffolds. So, going that way is more cumbersome.
Even though it's reported as an OutOfMemoryError, it has nothing to do with memory. Java does not support arrays longer than 2147483647 (and the precise limit is somewhat lower, depending on the JVM implementation). So changing -Xmx won't fix it; you simply had too many variants.