Piping Markduplicates
3
2
Entering edit mode
10.5 years ago
Sam ★ 4.7k

Currently our lab are facing a great difficulty when we suddenly obtain 700+ exome sequencing data. Our server starts to fail due to the crazy amount of io involved when performing the GATK. The GATK pipeline, together with the picard tools create a large amount of intermediate files and that add on a huge burden to our server. I have recently discovered the bcbio-nextgen and are still testing its performance. However, I also wonder whether if there are simpler way to to solve this problem (e.g. piping as many steps as possible). For example, this useful tutorial has taught how to perform the analysis up till markduplication. However, that was using the rmdup from samtools instead of the recommended picards MarkDuplicates. I have checked the documentation of picards, and it was stated that

Some Picard programs, e.g. MarkDuplicates, cannot read its input from stdin, because it makes multiple passes over the input file.

So is there any recommended alternatives to the MarkDuplicates than the one in picards such that piping is supported? Currently I am trying to do this:

bwa mem -M <Ref> <Read1.fq> <Read2.fq> | samtools sort - - | samtools view -bSh - | tee <Sample>.sorted.bam | samtools index -
picard ngs exome • 9.0k views
ADD COMMENT
6
Entering edit mode
10.5 years ago

Two alternatives to MarkDuplicates that support streaming are:

  • samtools rmdup -- This is generally considered slightly inferior to the algorithm used in MarkDuplicates due to not supporting some edge cases (see this Samtools + Picard Markduplicates). It does support streaming so the performance tradeoff is often worthwhile for large datasets. bcbio-nextgen currently supports this as a standard option.

  • biobambam's markduplicates2 -- This implements the MarkDuplicates algorithm but with improved efficiency. See the paper for more details. This does still require a two pass algorithm but makes use of temporary space so you can redirect the IO to a local high speed disk instead of networked disk. The man page has more details on the algorithm.

The goal is to support biobambam in bcbio-nextgen and I hope this will available in the short term. More generally, I'd love to hear about scaling issues you find. The goal of bcbio-nextgen is to be a community resource and pipeline handling exactly these kind of scaling issues, so feedback would be incredibly useful.

ADD COMMENT
0
Entering edit mode

I am really looking forward to the updates in bcbio-nextgen, currently I am timing the runtime for bcbio-nextgen and the traditional pipeline (maximizing the pipeline and using GATK). Hopefully I can comeback with the results soon

ADD REPLY
2
Entering edit mode
10.5 years ago

from the FAQ: http://sourceforge.net/apps/mediawiki/picard/index.php?title=Main_Page#Q:_Can_Picard_programs_read_from_stdin_and_write_to_stdout.3F

Q: Can Picard programs read from stdin and write to stdout?

A: Most Picard programs can do this. To read from stdin, specify /dev/stdin as the input file. To write to stdout, specify /dev/stdout as the output file, and also add the option QUIET=true to suppress other messages that otherwise would be written to stdout. Not that Picard determines whether to write in SAM or BAM format by examining the file extension of the output file. Since /dev/stdout ends in neither .sam nor .bam, Picard defaults to writing in BAM format. Some Picard programs, e.g. MarkDuplicates, cannot read its input from stdin, because it makes multiple passes over the input file. When writing a BAM to stdout so that it can be read by another program, passing the argument COMPRESSION_LEVEL=0 to the program writing the BAM to stdout can reduce unnecessary computation.

ADD COMMENT
0
Entering edit mode

Hi Pierre, thank you for your answer. I have checked the picard manual, however, I would really like to know if there are any other alternatives of Markduplicates allowing piping.

ADD REPLY
1
Entering edit mode

no if you want to use MarkDuplicates. if piping is preferable above the algorithm selection maybe you could consider using samtools' rmdup, although I would suggest sticking to MarkDuplicates and live with that intermediate temporal file.

ADD REPLY
1
Entering edit mode
6.4 years ago
firestar ★ 1.6k

How about samblaster?

ADD COMMENT

Login before adding your answer.

Traffic: 2452 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6