Currently our lab are facing a great difficulty when we suddenly obtain 700+ exome sequencing data. Our server starts to fail due to the crazy amount of io involved when performing the GATK. The GATK pipeline, together with the picard tools create a large amount of intermediate files and that add on a huge burden to our server. I have recently discovered the bcbio-nextgen and are still testing its performance. However, I also wonder whether if there are simpler way to to solve this problem (e.g. piping as many steps as possible). For example, this useful tutorial has taught how to perform the analysis up till markduplication. However, that was using the rmdup from samtools instead of the recommended picards MarkDuplicates. I have checked the documentation of picards, and it was stated that
Some Picard programs, e.g. MarkDuplicates, cannot read its input from stdin, because it makes multiple passes over the input file.
So is there any recommended alternatives to the MarkDuplicates than the one in picards such that piping is supported? Currently I am trying to do this:
bwa mem -M <Ref> <Read1.fq> <Read2.fq> | samtools sort - - | samtools view -bSh - | tee <Sample>.sorted.bam | samtools index -