Question: Which Bioinformatic Friendly Pipeline Building Framework?
55
gravatar for Carlos Borroto
3.7 years ago by
Carlos Borroto1.5k
Washington Metropolitan Area
Carlos Borroto1.5k wrote:

Hi,

I'm part of a team involve in a project where we will be running a stable analysis pipeline over a large number of samples.

QC(custom scripts) / Mapping(bwa mem) / Variant Calling(GATK Best Practices).

We would like not to reinvent the wheel and build the pipeline using a stablished framework. Ideally this framework is not too focus in this particular pipeline in case we need something else in the future.

I got good information from this previous Biostar's post. This is a summary of options from that post:

Not mentioned in that post but that I'm also looking into:

New options after this post was initially written:

I would love to get the community opinion on this subject. I'm particular fun right now of Snakemake, gkno and Invoke. I love Snakemake simplicity and how close to the regular make it is. It seems like Invoke is the current winner around the Python community at large. gkno seems like exactly what we need, but I'm worry it could get too complex and hard to maintain.

Latest edit: Added Toil.
scripting • 12k views
ADD COMMENTlink modified 3 months ago by pwwang0 • written 3.7 years ago by Carlos Borroto1.5k
5

I've been very happy with snakemake. The cluster support is pretty robust and and works on our rather odd PBS system just fine. The author is extremely responsive (bug fixes in minutes to hours, typically).

ADD REPLYlink written 3.7 years ago by Sean Davis23k
3

I can second snakemake. It is very readable and intuitive, can work with clusters and pretty robust.

ADD REPLYlink written 2.2 years ago by Tom190
1

I like snakemake because I can whip something up quickly, but it gets very slow when running on hundreds of tasks.

ADD REPLYlink modified 9 months ago • written 9 months ago by Lynxoid200
2

Look at all these Python pipeline frameworks!
Wrappers for subprocess, to wrap Popen, to wrap os.execvp, to finally, and inevitably, run somescript.sh

ADD REPLYlink written 2.4 years ago by John11k
2

Might add nextflow...

http://nextflow.io/

ADD REPLYlink modified 23 months ago • written 23 months ago by Sean Davis23k

I think you are in as good a position as anyone to review these

ADD REPLYlink written 3.7 years ago by Jeremy Leipzig17k

Nice list of pipeline tool/framework. It can be useful for lot of person.

Would be nice to add Bpipe as mentioned below. (https://code.google.com/p/bpipe/).

This is a nice one well documented and well maintained. (Here is the publication in Bioinformatics: http://www.ncbi.nlm.nih.gov/pubmed/22500002)

ADD REPLYlink written 20 months ago by Juke-341.0k

Just a correction: the documentation for bpipes is now here. I am not affiliated with the tool in any way, other than being a user :)

ADD REPLYlink modified 18 months ago • written 20 months ago by A. Domingues1.4k

We've recently developed NextflowWorkbench, which builds on Nextflow, but adds a user interface, modular workflow with libraries of processes and a docker IDE. Your workflows can be developed on a laptop/desktop and then run on a cluster or in the cloud. See this recent preprint: http://biorxiv.org/content/early/2016/03/28/041236

ADD REPLYlink written 17 months ago by fac2003160
25
gravatar for Jeremy Leipzig
18 months ago by
Philadelphia, PA
Jeremy Leipzig17k wrote:

A review of bioinformatic pipeline frameworks

http://bib.oxfordjournals.org/content/early/2016/03/23/bib.bbw020.full

High-throughput bioinformatic analyses increasingly rely on pipeline frameworks to process sequence and metadata. Modern implementations of these frameworks differ on three key dimensions: using an implicit or explicit syntax, using a configuration, convention or class-based design paradigm and offering a command line or workbench interface. Here I survey and compare the design philosophies of several current pipeline frameworks. I provide practical recommendations based on analysis requirements and the user base.

I wrote this review paper in order to bring some organization to the discussion of pipeline frameworks.

enter image description here

ADD COMMENTlink modified 18 months ago • written 18 months ago by Jeremy Leipzig17k
7

As I pointed out to the author on Twitter, the information in this table seems very arbitrary. If you read the review carefully, it never defines how the number of stars is determined for each category. In essence, these visuals are the opinion of the author, and I think are misguiding. My experience suggests that the category that contains Snakemake, BigDataScript and Nextflow have much better performance than Galaxy/Taverna, but are likely more difficult to use for beginners, pretty much the opposite of what the table shows.

ADD REPLYlink written 17 months ago by fac2003160
1

Great article, thanks a lot. Glad to see the work behind toil getting mentioned as well. I stumbled onto Toil a little by accident this summer and have switched over completely. I've coded up a python library of wrappers for different tools and specific configurations and it all sits on top of Toil for handling task processing and job allocation/execution. Its quite powerful. Just getting my scripts up and running with Mesos now.

ADD REPLYlink written 18 months ago by Dan Gaston6.8k
1

thanks. all this sounds very complicated - i'm subtracting a half-star

ADD REPLYlink modified 18 months ago • written 18 months ago by Jeremy Leipzig17k
1

For toil? It can be very straightforward. The toil aspect of writing any code is actually itself quite simple (although the documentation is currently a little sparse). Its quite easy to write a script of toil tasks and just submit it. In my case I wanted a system a bit more like bcbio-nextgen in some respects, so thats all of the additional code I've been working on.

ADD REPLYlink written 18 months ago by Dan Gaston6.8k

Any links to code? What do you do to get a good Mesos environment up-and-running? I've been using snakemake quite happily, but the common workflow language folks seem quite interested in toil. In addition, toil seems to be a bit more platform agnostic.

ADD REPLYlink written 18 months ago by Sean Davis23k

Getting mesos itself up and running is pretty straightforward, although I'm no expert and haven't yet done a lot of job submissions with it. I'm also currently debugging any tweaks I may need to make to my toil script as it doesn't seem to be cleanly submitting a job to the whole mesos cluster. But I think that is a configuration issue on my part. Just haven't had a chance to do it yet. I'll post something when I have it up. For getting a mesos cluster going I recommend the mesosphere tutorial: here. I'm working on a physical cluster with no other HPC software running on it, so it is set up like independent machines. The tutorial would also work for cloud instances.

ADD REPLYlink written 17 months ago by Dan Gaston6.8k

Nice!! Thank you Jeremy :)

ADD REPLYlink modified 18 months ago • written 18 months ago by John11k

Jeremy, Its a nice review article, thank you for posting it here

ADD REPLYlink modified 17 months ago • written 17 months ago by gsr999960
5
gravatar for Christian
3.7 years ago by
Christian2.5k
Vienna
Christian2.5k wrote:

If you don't need cluster support, my vote goes to the good old make. Powerful and bug free.

ADD COMMENTlink written 3.7 years ago by Christian2.5k
1

Make is good but not scalable in any way.  Cluster support for shared and distributed filesystems (such as Amazon) are pretty much not possible with make.

ADD REPLYlink written 23 months ago by ngsbioinformatics30

This is absolutely not true. Electric Make can let you build a cluster of literally any size of nodes, physical hardware or cloud, and parallelize not just Makefile builds but also any product that divides work by spawning processes. I used to be one of teir pre-sales solutions engineers. It's not free, but you get what you pay for. There is also a free community product but it's limited to your local developer network, max 8 machines, 8 cores per machine. But you'd be surprised how much perfomance you can get out of a small cluster like that.

ADD REPLYlink modified 5 months ago • written 5 months ago by flybd50
1

Appreciate your addition but the "This is absolutely not true" well, isn't true. @ngsbioinformatics was referring to plain old make, and not Electric Make. But good to know that that product exists and is capable.

ADD REPLYlink written 5 months ago by Dan Gaston6.8k
5
gravatar for Johan
3.7 years ago by
Johan770
Sweden
Johan770 wrote:

I've been working with Queue for about a year and a half now, and have it deployed in production at our core facility. I find that it strikes a good balance between expressiveness and simplicity. It has good cluster support, will of course play really nicely with all the GATK tools and is easy to extend to any command line program you might want to run. If you're interested here is the "fork" that we run: https://github.com/johandahlberg/piper including some pipelines.

ADD COMMENTlink written 3.7 years ago by Johan770
1

i'm kind of in awe of how much patience you have for this framework. Outside of the Broad itself you are pretty much the only one with a real working Queue pipeline on Github.

Have you thought about abstracting Queue into a DSL for mere mortals?

ADD REPLYlink written 2.4 years ago by Jeremy Leipzig17k
4
gravatar for Endre Bakken Stovner
2.4 years ago by
Norway
Endre Bakken Stovner800 wrote:

I use snakemake; it is a make implemented in python allowing you to use python in your rules. It is made especially for bioinformatics pipelines. Read all about it here: https://bitbucket.org/johanneskoester/snakemake/wiki/Documentation

If you know and like python, this might be the best choice for you.

It is robust, actively developed and open source.

Paper from bioinformatics: Snakemake—a scalable bioinformatics workflow engine

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by Endre Bakken Stovner800
1
gravatar for Neilfws
3.7 years ago by
Neilfws47k
Sydney, Australia
Neilfws47k wrote:

Another option: NGSANE. Soon to be published.

ADD COMMENTlink written 3.7 years ago by Neilfws47k
1
gravatar for Milan Simonovic
2.9 years ago by
Switzerland/Zürich
Milan Simonovic20 wrote:

Add BigDataScript to the list. It's another scripting language to learn, but then it allows you to seamlessly run pipelines locally or on a cluster, manage jobs, make checkpoints during execution, etc. Open sourced and published (2014).

ADD COMMENTlink written 2.9 years ago by Milan Simonovic20

Added. Looks pretty good. Love well documented projects from the beginning. If I have to write a new pipeline, I will make sure to consider BigDataScript.

ADD REPLYlink written 2.9 years ago by Carlos Borroto1.5k

Yeah, I added it to my own list when I came across the paper. 

ADD REPLYlink written 2.9 years ago by Dan Gaston6.8k
1
gravatar for Yahan
2.4 years ago by
Yahan360
Belgium
Yahan360 wrote:

Bpipe is the tool of choice here. Excellent support for threading, easy restarting of jobs that failed at certain step in the workflow, easy stitching together different steps, management of input and output naming.

https://code.google.com/p/bpipe/

Amazing it's not in here already

ADD COMMENTlink written 2.4 years ago by Yahan360
1
gravatar for Malachi Griffith
19 months ago by
Washington University School of Medicine, St. Louis, USA
Malachi Griffith15k wrote:

In the paper Genome Modeling System: A Knowledge Management Platform for Genomics we talk about some of the principles one might think about when creating pipeline management infrastructure for genomics.  In that publication we included a list of relevant resources: Genome Analysis Platforms.

Some additional options that perhaps could be included in your very nice list above:

  • Arvados
  • DNA Nexus
  • BaseSpace

 

ADD COMMENTlink written 19 months ago by Malachi Griffith15k
1
gravatar for Carlos Borroto
17 months ago by
Carlos Borroto1.5k
Washington Metropolitan Area
Carlos Borroto1.5k wrote:

The Broad recently announced their replacement for Queue, Cromwell/WDL. We just starting checking it out and it looks promising. When we did the initial search 2 years ago, we ended choosing Queue. It worked for us and it was nice to get free advanced scather-and-gather for GATK tools. However, maintaining Queue scripts in Scala was painful, particularly for non-GATK tools. We recently decided migrate to Snakemake, our initial runner up.

With the announcement of Cromwell and the near future release of WDL GATK Best Practices implementation, we are reconsidering migrating to Cromwell.

ADD COMMENTlink written 17 months ago by Carlos Borroto1.5k
1
gravatar for Juke-34
8 months ago by
Juke-341.0k
Sweden
Juke-341.0k wrote:

A really interesting list is available here => https://github.com/pditommaso/awesome-pipeline

ADD COMMENTlink written 8 months ago by Juke-341.0k
0
gravatar for Dan Gaston
3.7 years ago by
Dan Gaston6.8k
Canada
Dan Gaston6.8k wrote:

I've been trying a few different approaches over the last year or so. Currently my production pipeline is implemented as a makefile, per sample. All of my analysis is being run on a local workstation and not a cluster so it works well for that. I have been developing a data management system (hopefully soon to be written up and published) and am trying out a few more complex approaches there to make it more scalable. For relatively straightforward pipelines I do really like make or snakemake, particularly if this doesn't need to be run on a cluster.

I highly recommend versioning your make file templates. Anytime you make changes it should be a new version. For all projects/samples always store the make file that was used with the data. This means you can always reproduce your data exactly. You should also version and indicate what versions of bin files (BWA, GATK, Picard, etc) were used.

ADD COMMENTlink written 3.7 years ago by Dan Gaston6.8k

you have one makefile per sample? so if you change the pipeline ...

ADD REPLYlink written 3.7 years ago by brentp22k

Well I also have helper scripts as well, and the pipeline is stored as a template. So If I change the pipeline I just generate new makefiles from the new template and re-run it on whatever samples I want to re-run it on. I'm currently experimenting with some alternatives though in a more robust management system.

ADD REPLYlink written 3.7 years ago by Dan Gaston6.8k
0
gravatar for Michele Busby
3.7 years ago by
Michele Busby1.6k
United States
Michele Busby1.6k wrote:

I have been meaning to look into GenePattern http://www.broadinstitute.org/cancer/software/genepattern/ They have some nice stuff set up though I don't know how cluster submission works outside Broad.

ADD COMMENTlink written 3.7 years ago by Michele Busby1.6k
0
gravatar for A. Domingues
2.4 years ago by
A. Domingues1.4k
Mainz, Germany
A. Domingues1.4k wrote:

I am looking at Omics Pipe and Bpipe at the moment. The former appears to be relatively easy to implement and later is used by our core facility. Decisions, decisions.

 

Does anyone here has experience with Omics pipe?

ADD COMMENTlink written 2.4 years ago by A. Domingues1.4k
0
gravatar for ngsbioinformatics
23 months ago by
United States
ngsbioinformatics30 wrote:

Of all these pipeline infrastructures, which allow you to distribute parts of the pipeline to compute nodes and other parts on a single node, such as the GATK Exome Pipeline.  You can map the samples on different nodes, but when doing indel realigning or recalibration, its best to have all the samples on a single node.  After that, you can continue processing each sample on the compute nodes.  I'm only seen BDS and Queue be able to handle this. 

ADD COMMENTlink written 23 months ago by ngsbioinformatics30

Snakemake allows you to specify rules that are to be run locally (localrules). It would be more difficult to script that a specific rule get run on a specific node, but it's possible depending on your scheduler.

ADD REPLYlink written 23 months ago by Jeremy Leipzig17k
0
gravatar for kaixian110
3 months ago by
kaixian1100
kaixian1100 wrote:

how about Toil ? it's not suitable for bioinformatics,I think.

ADD COMMENTlink modified 3 months ago • written 3 months ago by kaixian1100

Toil is explicitly written as a bioinformatics pipeline. It is developed by a genomics group after all. I've been using Toil for over a year now in Development and Production environments.

ADD REPLYlink written 3 months ago by Dan Gaston6.8k
0
gravatar for pwwang
3 months ago by
pwwang0
pwwang0 wrote:

Another option: pyppl

ADD COMMENTlink written 3 months ago by pwwang0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1848 users visited in the last hour