I am having major problems with getting MarkDuplicates to run to completion.
- Java 1.8.0 64-bit
- picard-tools 2.0.1
I was under the impression that Picard's MarkDuplicates was relatively memory inexpensive (compared to other tools used in variant calling pipeline). However, MarkDuplicates consistently allocates arrays that exceed the VM limit. What is going on?
Sample Script (automatically generated):
echo '#!/usr/bin/env bash java -Xms12G -Xmx14G -jar /path/to/picard-tools-2.0.1/picard.jar MarkDuplicates INPUT=[filepath].bwamem.sorted.bam OUTPUT=[filepath].bwamem.sorted.dedup.bam REMOVE_DUPLICATES=true MAX_RECORDS_IN_RAM=350000 ASSUME_SORTED=true METRICS_FILE=[filepath].dedup_metrics.txt' | sbatch --job-name=[job name] --time=0 --mem=24G --partition=bigmemh
Okay, so here is my reasoning with these parameters:
- Set the initial heap space a little lower than the maximum heap space to target 12G as the optimal memory to be used by the process (constantly adjusts memory consumption)
- Set maximum heap space to 14G, this should be plenty
- Set the max records in RAM to 1/10 the recommended amount by GATK, according to the documentation
- Set the memory allocated by the dispatcher to be much greater than the maximum heap space for the process
Still, this doesn't work, with the same error arising after a long time, usually just after the process has "completed" and before files are written (I guess).
I have used Heap sizes that are generally very low (this helps according to old forum posts): 2GB, 4GB, 6GB, 8GB, 12GB, 16GB, 20GB
... and I have also tried a variety of higher values: 30GB, 50GB, 80GB, 100GB, 200GB, 300GB, 400GB, 480GB (max RAM for a node on our cluster). All have the same result.
Here is a sample stack trace:
Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:3332) at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137) at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121) at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421) at java.lang.StringBuilder.append(StringBuilder.java:136) at htsjdk.samtools.SAMTextHeaderCodec.advanceLine(SAMTextHeaderCodec.java:131) at htsjdk.samtools.SAMTextHeaderCodec.decode(SAMTextHeaderCodec.java:86) at htsjdk.samtools.BAMFileReader.readHeader(BAMFileReader.java:503) at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:166) at htsjdk.samtools.BAMFileReader.<init>(BAMFileReader.java:125) at htsjdk.samtools.SamReaderFactory$SamReaderFactoryImpl.open(SamReaderFactory.java:261) at picard.sam.markduplicates.util.AbstractMarkDuplicatesCommandLineProgram.openInputs(AbstractMarkDuplicatesCommandLineProgram.java:204) at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:291) at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:139) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:209) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)
So, what parameters should I be using to get MarkDuplicates to run to completion? What am I doing wrong?
Don't know? Then what should I use instead of MarkDuplicates?
Some more information. It seems that my files have the problem of accumulating unmatched pairs very quickly. I guess this is caused by many records with mates that are far away on the same chromosome. I have no idea how to rectify this issue, but I am going to try to filter low quality reads before marking duplicates and see if that works. The problem is not actually memory, but poor memory usage in this particular case. I will try an even smaller MAX_RECORDS_IN_RAM value as well, as disk space is no concern.