Question: Piping Markduplicates
gravatar for Sam
5.5 years ago by
New York
Sam2.3k wrote:

Currently our lab are facing a great difficulty when we suddenly obtain 700+ exome sequencing data. Our server starts to fail due to the crazy amount of io involved when performing the GATK. The GATK pipeline, together with the picard tools create a large amount of intermediate files and that add on a huge burden to our server. I have recently discovered the bcbio-nextgen and are still testing its performance. However, I also wonder whether if there are simpler way to to solve this problem (e.g. piping as many steps as possible). For example, this useful tutorial has taught how to perform the analysis up till markduplication. However, that was using the rmdup from samtools instead of the recommended picards MarkDuplicates. I have checked the documentation of picards, and it was stated that

Some Picard programs, e.g. MarkDuplicates, cannot read its input from stdin, because it makes multiple passes over the input file.

So is there any recommended alternatives to the MarkDuplicates than the one in picards such that piping is supported? Currently I am trying to do this:

bwa mem -M <Ref> <Read1.fq> <Read2.fq> | samtools sort - - | samtools view -bSh - | tee <Sample>.sorted.bam | samtools index -
exome ngs picard • 5.6k views
ADD COMMENTlink modified 17 months ago by rmf570 • written 5.5 years ago by Sam2.3k
gravatar for Brad Chapman
5.5 years ago by
Brad Chapman9.4k
Boston, MA
Brad Chapman9.4k wrote:

Two alternatives to MarkDuplicates that support streaming are:

  • samtools rmdup -- This is generally considered slightly inferior to the algorithm used in MarkDuplicates due to not supporting some edge cases (see this Samtools + Picard Markduplicates). It does support streaming so the performance tradeoff is often worthwhile for large datasets. bcbio-nextgen currently supports this as a standard option.

  • biobambam's markduplicates2 -- This implements the MarkDuplicates algorithm but with improved efficiency. See the paper for more details. This does still require a two pass algorithm but makes use of temporary space so you can redirect the IO to a local high speed disk instead of networked disk. The man page has more details on the algorithm.

The goal is to support biobambam in bcbio-nextgen and I hope this will available in the short term. More generally, I'd love to hear about scaling issues you find. The goal of bcbio-nextgen is to be a community resource and pipeline handling exactly these kind of scaling issues, so feedback would be incredibly useful.

ADD COMMENTlink written 5.5 years ago by Brad Chapman9.4k

I am really looking forward to the updates in bcbio-nextgen, currently I am timing the runtime for bcbio-nextgen and the traditional pipeline (maximizing the pipeline and using GATK). Hopefully I can comeback with the results soon

ADD REPLYlink written 5.5 years ago by Sam2.3k
gravatar for Pierre Lindenbaum
5.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum119k wrote:

from the FAQ:

Q: Can Picard programs read from stdin and write to stdout?

A: Most Picard programs can do this. To read from stdin, specify /dev/stdin as the input file. To write to stdout, specify /dev/stdout as the output file, and also add the option QUIET=true to suppress other messages that otherwise would be written to stdout. Not that Picard determines whether to write in SAM or BAM format by examining the file extension of the output file. Since /dev/stdout ends in neither .sam nor .bam, Picard defaults to writing in BAM format. Some Picard programs, e.g. MarkDuplicates, cannot read its input from stdin, because it makes multiple passes over the input file. When writing a BAM to stdout so that it can be read by another program, passing the argument COMPRESSION_LEVEL=0 to the program writing the BAM to stdout can reduce unnecessary computation.

ADD COMMENTlink written 5.5 years ago by Pierre Lindenbaum119k

Hi Pierre, thank you for your answer. I have checked the picard manual, however, I would really like to know if there are any other alternatives of Markduplicates allowing piping.

ADD REPLYlink written 5.5 years ago by Sam2.3k

no if you want to use MarkDuplicates. if piping is preferable above the algorithm selection maybe you could consider using samtools' rmdup, although I would suggest sticking to MarkDuplicates and live with that intermediate temporal file.

ADD REPLYlink written 5.5 years ago by Jorge Amigo11k
gravatar for rmf
17 months ago by
rmf570 wrote:

How about samblaster?

ADD COMMENTlink written 17 months ago by rmf570
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1643 users visited in the last hour