Question

Error in picard tool: Mark Duplicates

0

Entering edit mode

7.2 years ago

jcanitz • 0

Hey,

I have a problem with the picard tool: Mark Duplicates

To get an overview: I have a *.bam input file which is sorted. And I want to use the MarkDuplicates-tool from picard.jar. First I tried to run it with default. But I got an error, which was solved by adding "READ_NAME_REGEX=null \". Now, my command is:

java -jar /home/jules/programs/picard/build/libs/picard.jar MarkDuplicates \
I=CCO_bowtie2Mapping.sorted.bam \
O=CCO_picard_DM.bam \
M=CCO_marked_dup_metrics.txt \
READ_NAME_REGEX=null \

But I got a new error. Unfortunately, I cannot detect the real error in this bunch of notes the terminal is showing me.

[Mon Feb 20 13:53:03 CET 2017] picard.sam.markduplicates.MarkDuplicates SORTING_COLLECTION_SIZE_RATIO=0.1 INPUT=[CCO_bowtie2Mapping.sorted.bam] OUTPUT=CCO_picard_DM.bam METRICS_FILE=CCO_marked_dup_metrics.txt READ_NAME_REGEX=null TMP_DIR=[/home/jules/programs/picard/build/tmp/compileJava/emptySourcePathRef/working_tmp]    MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 REMOVE_SEQUENCING_DUPLICATES=false TAGGING_POLICY=DontTag REMOVE_DUPLICATES=false ASSUME_SORTED=false DUPLICATE_SCORING_STRATEGY=SUM_OF_BASE_QUALITIES PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false VALIDATION_STRINGENCY=STRICT COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false GA4GH_CLIENT_SECRETS=client_secrets.json
[Mon Feb 20 13:53:03 CET 2017] Executing as jules@jules-TERRA-PC on Linux 4.4.0-62-generic amd64; Java HotSpot(TM) 64-Bit Server VM 1.8.0_111-b14; Picard version: 2.7.1-24-gc8bbc7a-SNAPSHOT
INFO    2017-02-20 13:53:03 MarkDuplicates  Start of doWork freeMemory: 245340416; totalMemory: 251658240; maxMemory: 3726639104
INFO    2017-02-20 13:53:03 MarkDuplicates  Reading input file and constructing read end information.
INFO    2017-02-20 13:53:03 MarkDuplicates  Will retain up to 5733290 data points before spilling to disk.
[Mon Feb 20 13:53:08 CET 2017] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 0.09 minutes.
Runtime.totalMemory()=1812987904
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" htsjdk.samtools.SAMException: /home/jules/programs/picard/build/tmp/compileJava/emptySourcePathRef/working_tmp/CSPI.6417316537446830880.tmp/1567.tmpnot found
    at htsjdk.samtools.util.FileAppendStreamLRUCache$Functor.makeValue(FileAppendStreamLRUCache.java:63)
    at htsjdk.samtools.util.FileAppendStreamLRUCache$Functor.makeValue(FileAppendStreamLRUCache.java:49)
    at htsjdk.samtools.util.ResourceLimitedMap.get(ResourceLimitedMap.java:76)
    at htsjdk.samtools.CoordinateSortedPairInfoMap.getOutputStreamForSequence(CoordinateSortedPairInfoMap.java:180)
    at htsjdk.samtools.CoordinateSortedPairInfoMap.ensureSequenceLoaded(CoordinateSortedPairInfoMap.java:102)
    at htsjdk.samtools.CoordinateSortedPairInfoMap.remove(CoordinateSortedPairInfoMap.java:86)
    at picard.sam.markduplicates.util.DiskBasedReadEndsForMarkDuplicatesMap.remove(DiskBasedReadEndsForMarkDuplicatesMap.java:61)
    at picard.sam.markduplicates.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:471)
    at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:222)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:208)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:95)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:105)
Caused by: java.io.FileNotFoundException: /home/jules/programs/picard/build/tmp/compileJava/emptySourcePathRef/working_tmp/CSPI.6417316537446830880.tmp/1567.tmp (Too many open files)
    at java.io.FileOutputStream.open0(Native Method)
    at java.io.FileOutputStream.open(FileOutputStream.java:270)
    at java.io.FileOutputStream.<init>(FileOutputStream.java:213)
    at htsjdk.samtools.util.FileAppendStreamLRUCache$Functor.makeValue(FileAppendStreamLRUCache.java:60)
    ... 11 more

I tried to increase the number of open folders, but it did not work with the unlimited number. Hopefully, some can help me. Thanks a lot. Julia

Picard MarkDuplicates sam-Input software error • 3.5k views

ADD COMMENT • link updated 7.2 years ago by dariober 14k • written 7.2 years ago by jcanitz • 0

score 0 · Answer 1 · 2017-02-20

0

Entering edit mode

7.2 years ago

dariober 14k

MarkDuplicates takes an option MAX_FILE_HANDLES_FOR_READ_ENDS_MAP (Integer):

Maximum number of file handles to keep open when spilling read ends to disk. Set this number a little lower than the per-process maximum number of file that may be open. This number can be found by executing the 'ulimit -n' command on a Unix system. Default value: 8000. This option can be set to 'null' to clear the default value.

Try reducing this number maybe?

I've never seen this problem though, are your files particularly big?

As an aside, I've been impressed by the speed of processing and ease of use of sambamba markdup.

ADD COMMENT • link 7.2 years ago by dariober 14k

1

Entering edit mode

Additionally, you're running with very little memory, roughly 2GB:

Runtime.totalMemory()=1812987904

Running on a system with more memory will greatly increase stability. 2GB will cause odd behavior in many bioinformatics applications that deal with large files (particularly files too big to fit into memory). If you have more than 2GB, be sure you are using a 64-bit version of Java. Also, you can manually specify "-Xmx16g", for example, to increase Java's memory limit if, in fact, you have more than 16GB RAM installed. More memory will, in this case, allow fewer temp files to be written.

For rapidly marking or removing duplicates, you can also look at Clumpify, which does it from fastq files and does not need alignment.

ADD REPLY • link 7.2 years ago by Brian Bushnell 20k