MarkDuplicates memory issue
0
0
Entering edit mode
6.9 years ago
cacampbell ▴ 50

Hello,

I am having major problems with getting MarkDuplicates to run to completion.

Relevant Programs:

- Java 1.8.0 64-bit

- picard-tools 2.0.1

I was under the impression that Picard's MarkDuplicates was relatively memory inexpensive (compared to other tools used in variant calling pipeline). However, MarkDuplicates consistently allocates arrays that exceed the VM limit. What is going on?

Sample Script (automatically generated):

echo '#!/usr/bin/env bash
java -Xms12G -Xmx14G -jar /path/to/picard-tools-2.0.1/picard.jar MarkDuplicates INPUT=[filepath].bwamem.sorted.bam OUTPUT=[filepath].bwamem.sorted.dedup.bam             REMOVE_DUPLICATES=true MAX_RECORDS_IN_RAM=350000 ASSUME_SORTED=true             METRICS_FILE=[filepath].dedup_metrics.txt' | sbatch --job-name=[job name] --time=0 --mem=24G --partition=bigmemh

Okay, so here is my reasoning with these parameters:

- Set the initial heap space a little lower than the maximum heap space to target 12G as the optimal memory to be used by the process (constantly adjusts memory consumption)

- Set maximum heap space to 14G, this should be plenty

- Set the max records in RAM to 1/10 the recommended amount by GATK, according to the documentation

- Set the memory allocated by the dispatcher to be much greater than the maximum heap space for the process

Still, this doesn't work, with the same error arising after a long time, usually just after the process has "completed" and before files are written (I guess).

I have used Heap sizes that are generally very low (this helps according to old forum posts): 2GB, 4GB, 6GB, 8GB, 12GB, 16GB, 20GB

... and I have also tried a variety of higher values: 30GB, 50GB, 80GB, 100GB, 200GB, 300GB, 400GB, 480GB (max RAM for a node on our cluster). All have the same result.

Here is a sample stack trace:

Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.util.Arrays.copyOf(Arrays.java:3332)
at java.lang.AbstractStringBuilder.expandCapacity(AbstractStringBuilder.java:137)
at java.lang.AbstractStringBuilder.ensureCapacityInternal(AbstractStringBuilder.java:121)
at java.lang.AbstractStringBuilder.append(AbstractStringBuilder.java:421)
at java.lang.StringBuilder.append(StringBuilder.java:136)
at htsjdk.samtools.SamReaderFactorySamReaderFactoryImpl.open(SamReaderFactory.java:261) at picard.sam.markduplicates.util.AbstractMarkDuplicatesCommandLineProgram.openInputs(AbstractMarkDuplicatesCommandLineProgram.java:204) at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:291) at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:139) at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:209) at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95) at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105) So, what parameters should I be using to get MarkDuplicates to run to completion? What am I doing wrong? Don't know? Then what should I use instead of MarkDuplicates? Thanks __________________ EDIT __________________ Some more information. It seems that my files have the problem of accumulating unmatched pairs very quickly. I guess this is caused by many records with mates that are far away on the same chromosome. I have no idea how to rectify this issue, but I am going to try to filter low quality reads before marking duplicates and see if that works. The problem is not actually memory, but poor memory usage in this particular case. I will try an even smaller MAX_RECORDS_IN_RAM value as well, as disk space is no concern. alignment Picard MarkDuplicates • 9.7k views ADD COMMENT 0 Entering edit mode You might test (e.g., with /usr/bin/time -v) how much memory picard is actually using. We recently had an issue where the scheduler was killing things incorrectly, regardless of what sort of memory we requested. ADD REPLY 0 Entering edit mode Exception in thread "main" java.lang.OutOfMemoryError: Requested array size exceeds VM limit  This is definitely an issue of Picard running out of memory, rather than the scheduler killing it. Picard is very inefficient, but it shouldn't need that much memory. How big is the bam file? Sometime people set the JAVA_OPTS environment variable, which screws everything up. It overrides your -Xmx argument. Try echoJAVA_OPTS to see if it is set. If so, clear it and you should be fine.

0
Entering edit mode

Thanks, will try this and report back.

EDIT: JAVA_OPTS has not been set :/

These BAM files are 80GB at maximum (a single lane each)

0
Entering edit mode

hi,

One another optimization you could add is COMPRESSION_LEVEL=0

as one of the parameters. By default, Picard tries to make a compressed BAM (smaller size). Using that parameter makes the BAM larger in size but Picard then runs much faster. I am not sure if this would help in memory issue but give it a try.

0
Entering edit mode

Er, if it's running out of memory, how would increasing the file size solve the problem?

1
Entering edit mode

Compression takes memory. Having said that, if it crashes with 480GB of RAM then the small amount required for compression isn't going to be the issue.

0
Entering edit mode

I understand, but as you said, the overhead for compression << your average BAM file. So the rationale for proposing this solution escapes me, but I thought I'd ask.

0
Entering edit mode

Thanks for taking the time to reply, I will try this.

0
Entering edit mode

If most of your reads are not properly paired, that will use much more memory. Perhaps you should verify that the files are ordered correctly. Some preprocessing steps (like using fastx toolkit for anything) will corrupt your files. What was the proper pairing rate, which would have been output by the mapping program? And what preprocessing steps did you take?

0
Entering edit mode

verify files are ordered correctly

Okay, sounds like a good plan.

We used bwa mem for alignment, which I believe should have around an 85% mapping rate. For preprocessing, we used FASTQC to inspect the reads and cutadapt to remove our adapters. The quality 'looked good', according to FASTQC.

So, not sure what the issue is -- but I am currently applying a filtering step to the Sorted BAM files to see if that alleviates the memory problem.

Sort I am using:

samtools view -F 0x04 -f 0x02 -bq 1 {in} > {out}


As I understand it (and, I am just a novice), this should remove all alignments with flag 4 and keep only alignments with flag 2 (redundant, but that's okay for peace of mind), and then filter quality scores of 1 (P_error ~ .8). 0x02, 0x2, 2 is PROPER_PAIR and 0x004, 0x4, 4 is UNMAP, according to samtools docs.

When this is done (actually taking a long time), then I will attempt to mark and remove duplicates again.

0
Entering edit mode

But specifically -

1. When you ran Cutadapt, did you do each file individually, or both at once?
2. When you mapped with bwa-mem, what did it say was the rate of properly-paired reads?

Also, if you still have the fastq input to the mapping program, you can download BBMap and run this:

reformat.sh in1=read1.fq in2=read2.fq vpair


This will verify that, according to the read names, the reads are correctly paired (ordered correctly).

Lastly - what kind of reference are you mapping against? If it's a metagenome with millions of tiny contigs, this kind of thing (a very low proper pair mapping rate) is expected.

0
Entering edit mode
1. Each file individually
2. Unfortunately, I wasn't the one who did these alignments, just the one who is responsible for getting SNPs out of them, so those output files are long gone, on some other cluster, at some other institution.

I do still have the original fastq files, so it is conceivable that I could redo this whole process if necessary. I will check that the pairs are ordered correctly according to BBMap. What should I be expecting? What can I do about the possible outputs?

Okay, so I think you are right on, the reference is from a large conifer species, and the reads that we got (from collaborators in China) for alignment were very quick and cheap. If I don't get MarkDuplicates to work here (and I haven't had issue with it on different projects), then I may just abandon this step and proceed with the filtered, sorted, reads.

Thanks for the help, I appreciate the donation of your time.

0
Entering edit mode

You're welcome. Also, BBMap has a utility [repair.sh] that can re-pair the reads in the fastq files if they were processed incorrectly in a prior step. You can also, of course, simply skip duplicate removal - if the data was not amplified, there's no reason to do it (which is important to know).

Still, the best approach is to start with the raw reads and process them correctly as pairs at every stage, and then map them as pairs; incorrectly-ordered reads sent to an aligner as pairs will give you inferior results due to the "rescue" operation of mapping programs and heuristics that evaluate potential mapping locations, as well as variant-callers which give different weights to variants seen in properly-paired versus unpaired reads.