I'm part of a team involve in a project where we will be running a stable analysis pipeline over a large number of samples.
QC(custom scripts) / Mapping(bwa mem) / Variant Calling(GATK Best Practices).
We would like not to reinvent the wheel and build the pipeline using a stablished framework. Ideally this framework is not too focus in this particular pipeline in case we need something else in the future.
I got good information from this previous Biostars post. This is a summary of options from that post:
I would love to get the community opinion on this subject. I'm particular fun right now of Snakemake, gkno and Invoke. I love Snakemake simplicity and how close to the regular make it is. It seems like Invoke is the current winner around the Python community at large.
gkno seems like exactly what we need, but I'm worry it could get too complex and hard to maintain.
High-throughput bioinformatic analyses increasingly rely on pipeline frameworks to process sequence and metadata. Modern implementations of these frameworks differ on three key dimensions: using an implicit or explicit syntax, using a configuration, convention or class-based design paradigm and offering a command line or workbench interface. Here I survey and compare the design philosophies of several current pipeline frameworks. I provide practical recommendations based on analysis requirements and the user base.
I wrote this review paper in order to bring some organization to the discussion of pipeline frameworks.
I've been working with Queue for about a year and a half now, and have it deployed in production at our core facility. I find that it strikes a good balance between expressiveness and simplicity. It has good cluster support, will of course play really nicely with all the GATK tools and is easy to extend to any command line program you might want to run. If you're interested here is the "fork" that we run: https://github.com/johandahlberg/piper including some pipelines.
Add BigDataScript to the list. It's another scripting language to learn, but then it allows you to seamlessly run pipelines locally or on a cluster, manage jobs, make checkpoints during execution, etc. Open sourced and published (2014).
Bpipe is the tool of choice here. Excellent support for threading, easy restarting of jobs that failed at certain step in the workflow, easy stitching together different steps, management of input and output naming.
The Broad recently announced their replacement for Queue, Cromwell/WDL. We just starting checking it out and it looks promising. When we did the initial search 2 years ago, we ended choosing Queue. It worked for us and it was nice to get free advanced scather-and-gather for GATK tools. However, maintaining Queue scripts in Scala was painful, particularly for non-GATK tools. We recently decided migrate to Snakemake, our initial runner up.
With the announcement of Cromwell and the near future release of WDL GATK Best Practices implementation, we are reconsidering migrating to Cromwell.
I've been trying a few different approaches over the last year or so. Currently my production pipeline is implemented as a makefile, per sample. All of my analysis is being run on a local workstation and not a cluster so it works well for that. I have been developing a data management system (hopefully soon to be written up and published) and am trying out a few more complex approaches there to make it more scalable. For relatively straightforward pipelines I do really like make or snakemake, particularly if this doesn't need to be run on a cluster.
I highly recommend versioning your make file templates. Anytime you make changes it should be a new version. For all projects/samples always store the make file that was used with the data. This means you can always reproduce your data exactly. You should also version and indicate what versions of bin files (BWA, GATK, Picard, etc) were used.
Of all these pipeline infrastructures, which allow you to distribute parts of the pipeline to compute nodes and other parts on a single node, such as the GATK Exome Pipeline. You can map the samples on different nodes, but when doing indel realigning or recalibration, its best to have all the samples on a single node. After that, you can continue processing each sample on the compute nodes. I'm only seen BDS and Queue be able to handle this.