Question

Workflow management software for pipeline development in NGS

31

Entering edit mode

11.1 years ago

michaelberinski ▴ 310

There was an interesting thread about any experience in automation of NGS pipelines How To Decide Which Software To Use For Building Automated Ngs Analysis Pipeline

Galaxy seemed to be the most prominent tools 3 years ago but had a few drawbacks that were mentioned.

I was wondering if there has been any new software for workflow automatisation that people are using? Whether you experienced any problems or limitations in achieving the commonly required tasks like:

Allow jobs to be run concurrently
Allow jobs to be restarted from where they left off
Support for Sun Grid Engine to launch tasks (out of the box)
Allow to run any Shell command
Reporting at the end of the run (with timings)

and optional tasks like:

Reporting during the run
Graphical representation of pipeline
Well maintained and support from a community

Obviously there are plenty of tools, but especially if you can recommend or have particular preference and experience in using any of such software? To mention a few that I had mind:

Snakelike    https://code.google.com/p/snakemake/        Python/DSL
Rufus        https://code.google.com/p/ruffus/           Python
Bpipe        https://code.google.com/p/bpipe/            Groovy/DSL
Galaxy       http://galaxyproject.org/                   Python/REST API
BioLite      https://bitbucket.org/caseywdunn/biolite    Python/C++

pipeline workflow • 20k views

ADD COMMENT • link updated 3.8 years ago by Ram 45k • written 11.1 years ago by michaelberinski ▴ 310

3

Entering edit mode

Not all bioinformatics-related, but a nice list:

https://github.com/pditommaso/awesome-pipeline

Developers here might consider adding their wares to the list.

ADD REPLY • link 9.8 years ago by Sean Davis 27k

0

Entering edit mode

Another python library ready for use in a production environment is Cosmos. Disclosure, I'm the author.

ADD REPLY • link 8.7 years ago by egafni ▴ 30

Ram · Answer 1 · 2014-10-16

17

Entering edit mode

11.0 years ago

pditommaso ▴ 230

I ended up engineering my own pipeline tool called Nextflow because was not happy with none of the frameworks out there.

Briefly:

It can execute any shell script, command or any mix of them.
It uses a declarative parallelisation model based on the dataflow paradigm. Tasks parallelisation, dependencies and synchronisation is implicitly defined by task inputs/outputs declarations.
On task errors it stops gently, showing the task error cause. User has the chance to reproduce the problem in order to fix it.
The pipeline execution can be resume from the last successful executed step.
The same pipeline script can be executed on multiple platforms: single workstation, cluster of computers (SGE,SLURM,LSF,PBS/Torque,DRMAA) and cloud (DnaNexus)
It can produce an execution tracing report with useful tasks runtime information (execution time, mem / cpu used, etc)
Notably it integrates the support with Docker. This feature is extremely useful to ship complex binary dependency by using one (or more) Docker images. Nextflow take care of running each task in its own container transparently
Graphical representation of the pipeline? No (at least for now)

You can find more here http://www.nextflow.io

Paolo

ADD COMMENT • link updated 3.9 years ago by Ram 45k • written 11.0 years ago by pditommaso ▴ 230

1

Entering edit mode

Written in groovy

ADD REPLY • link 8.8 years ago by ostrokach ▴ 350

0

Entering edit mode

What's the problem with that?

ADD REPLY • link 9.6 years ago by pditommaso ▴ 230

2

Entering edit mode

Not so much a 'problem', but a lot of Bioinformatics is done in R or Python or even just shell scripting, so coming from that background Groovy seems very alien. I've spent a lot of time the past weeks trying to figure out how Nextflow works and the combination of Groovy + Domain Specific Language has been a major hurdle for me. Your docs are really great, but its been difficult to figure out what is actually happening when I run the pipeline scripts, inspect objects, etc.

For anyone interested, this has been incredibly helpful:

https://github.com/nextflow-io/hack17-tutorial

And the docs are here:

https://www.nextflow.io/docs/latest/index.html

I would be very interested in workshops held in the USA for this, I saw there was one in Europe for it.

ADD REPLY • link 7.8 years ago by steve ★ 3.5k

1

Entering edit mode

Nothing really... just thought I'd point it out.

I spent a bit of time looking through the nextflow website before realizing that it's written in groovy. Most of the code I run is written in C / C++ / Python, so I would rather use a workflow manager written in one of those languages. It makes it easier to track down problems and minimizes heavy dependencies (i.e. Java VM).

For someone who mostly works with Java VM languages, groovy might be a plus.

ADD REPLY • link 9.5 years ago by ostrokach ▴ 350

Ram · Answer 2 · 2014-10-16

12

Entering edit mode

11.1 years ago

Sean Davis 27k

I use snakemake pretty heavily. I have not yet found many limitations and the author is quite responsive and active in development. Support is via a google group and questions are answered pretty quickly in my experience. For folks familiar with make and python, the learning curve will not be too steep.

ADD COMMENT • link updated 3.9 years ago by Ram 45k • written 11.1 years ago by Sean Davis 27k

2

Entering edit mode

I second Snakemake. It addresses the weaknesses of Make without abandoning the incredibly powerful implicit wildcard rules.

ADD REPLY • link 11.1 years ago by Jeremy Leipzig 23k

2

Entering edit mode

Another vote for Snakemake. I used to use Ruffus but once I gave Snakemake a try I never looked back. It still amazes me how even complicated workflows can be represented cleanly in Snakemake. It satisfies all the OP's requirements without any problems or limitations that I've found.

ADD REPLY • link updated 3.8 years ago by Ram 45k • written 11.1 years ago by Ryan Dale 5.0k

2

Entering edit mode

Snakemake should be the accepted answer! Found it through this thread and it's awesome.

ADD REPLY • link updated 5.8 years ago by Ram 45k • written 9.8 years ago by ostrokach ▴ 350

Ram · Answer 3 · 2014-10-16

9

Entering edit mode

11.1 years ago

Pierre Lindenbaum 166k

I use GNU make:

Allow jobs to be run concurrently: yes https://www.gnu.org/software/make/manual/html_node/Parallel.html

Allow jobs to be restarted from where they left off : yes if you re-run make https://www.gnu.org/software/make/manual/html_node/Prerequisite-Types.html . "if a target's prerequisite is updated, then the target should also be updated."

Support for Sun Grid Engine to launch tasks (out of the box) : yes http://gridscheduler.sourceforge.net/htmlman/htmlman1/qmake.html

Allow to run any Shell command : yes and even http://www.gnu.org/software/make/manual/make.html#Choosing-the-Shell

Reporting at the end of the run (with timings): no but easy to implement in the directives

Reporting during the run : no

Graphical representation of pipeline : yes

Well maintained and support from a community : Initial release 1977 , v4.0 released in 2014.

ADD COMMENT • link updated 6.1 years ago by Ram 45k • written 11.1 years ago by Pierre Lindenbaum 166k

5

Entering edit mode

I was using make. I really love it. I am, just like Pierre, a great fan of the command line ;). But I must confess that I have been weak, and just felt charmed by the elegance of snakemake and how can it be, following the same philosophy of make, much more simple and readable.

Jokes apart, in my case, multiple-output rules were the key to my transition, although I know make can handle them by using patterns. But my makefiles were far uglier than the current ones. Just to compensate my treason, I am still using vim. :)

ADD REPLY • link 10.6 years ago by gbayon ▴ 170

0

Entering edit mode

This question is not particularly related to NGS pipelines, but how cluster-friendly is Make? Is there any interaction with Torque?

ADD REPLY • link 11.0 years ago by David Westergaard ★ 1.5k

3

Entering edit mode

I use qmake (make+ SGE). There is nothing much to do than 'make -j N'. The jobs are dispatched on the cluster.

ADD REPLY • link 11.0 years ago by Pierre Lindenbaum 166k

Ram · Answer 4 · 2014-10-16

I did this search not so long ago and ended choosing Queue.

Allow jobs to be run concurrently

Yes.
Allow jobs to be restarted from where they left off

Yes. Not only that it has a retry failed option. Quite useful for cases where the node running the job crashes.
Support for Sun Grid Engine to launch tasks (out of the box)

Yes. Direct support out of the box. It also has support for any scheduler with a drmaa lib implementation, ex. Slurm.
Allow to run any Shell command

Kind of. You need to write a scala wrapper class for your tool. Quite easy to do so. I had zero experience with scala and very little with java when I started using Queue. I was able to write my own wrappers for bwa mem and a few other tools without much trouble.
Reporting at the end of the run (with timings)

Yes.

I do have some issues with Queue. First the licensing is still a mess. The GATK team uses Appistry for comercial licensing, but Appistry doesn't support Queue. So if you are commercial you have to pay the license, but you will be using a version of GATK embedded into Queue not supported by Appistry. Also, if like me your scala/java experience is limited, while is easy to write simple tools wrappers, things can get complicated fast. For example reusing your wrappers between multiple qscripts isn't as easy as it should be. I still haven't found a way to do this properly. A lot of copied/pasted code in my scripts at the moment.

Having said that Queue has a killer feature for me, Scatter & Gather. Almost every GATK tool has a partitioning type defined for it. Meaning Queue can automatically detect what kind of partitioning of the input data can be done and can launch multiple processes for the same file. This makes amazingly easy to do parallel processing of large input files. Even so when partitioning the data is more complicated than what gnu parallel would be able to do.

--Carlos

Ram · Answer 5 · 2014-10-17

2

Entering edit mode

11.0 years ago

Samuel Lampa ★ 1.3k

We're still researching options and possibilities, but wanted to chime in with the extended requirements list that we collected, from our own use cases, and what seemed to be things that many others ask for too, in the hope it might help tool makers to not miss any important requirements (in no particular order):

Atomic writes (don't write half-baked data, at least not when using file existence as flag for completion of task)
Integration with HPC resource managers such as SLURM, PBS etc. (Possibly via DRMAAv2)
Hadoop integration?
Stage temporary files to local disk (or separate folder in general)
Streaming mode
Batch mode
Streaming / Batch mode chosen with configurable switch
Don't start too many (OS level) processes (eg. max 1024 on UPPMAX)
Declarative workflow ("dependency graph") specification
Workflow / dependency graph definition separate from processes / task definitions.
Support an explorative usage pattern, by the use of "per request" jobs, that run a specified set of in-data through a specified part of the workflow, up to a specified point in the workflow graph, where it is persisted.
Specify data dependencies (not just task dependencies, as there can be more than one input/output to tasks!)
Be able to restart from existing persisted output from previous task
Be able to run on multiple nodes, with common task scheduler keeping track of dependencies (so no two processes run the same task)
Strategy for file naming (dependent upon task ids, and what makes each separate run unique such as parameters and run ids)
Support workflow execution and triggering based on availability of data
Should support automatic reporting of parameters, runtimes and tool and data versions.
Idempotency: Don't overwrite existing data, and running it twice should not be different than running it once.

Bonus points (would make a serious killer system):

Optimally support a flexible query language, that translates on demand into a dynamically generated data flow network.
Would be nice with a "self-learning" rule engine for deciding the job running time when scheduling, that could, based upon past running times, give an estimate based upon the file size of the current file. [Idea by Ino de Bruijn]

ADD COMMENT • link updated 3.8 years ago by Ram 45k • written 11.0 years ago by Samuel Lampa ★ 1.3k

1

Entering edit mode

As a matter of fact, after (possibly prematurely) settling on Spotify's luigi workflow system, we've done quite some work in the direction of the points above, although we're still far from the goal. See eg these posts, documenting our findings:

ADD REPLY • link 11.0 years ago by Samuel Lampa ★ 1.3k

0

Entering edit mode

For proper reference I need to mention that our experience resulted in the SciLuigi helper library on top of luigi: https://github.com/pharmbio/sciluigi (Though nowadays we are taking the ideas even further with http://scipipe.org).

ADD REPLY • link 8.4 years ago by Samuel Lampa ★ 1.3k

1

Entering edit mode

Have you looked into Anduril? Sounds like a fit for most requirements. The new version 2.0 will have Scala as workflow scripting language.

ADD REPLY • link 10.5 years ago by Christian ★ 3.1k

1

Entering edit mode

I second two of these items in particular: file naming strategy and workflow/dependency declaration. The file naming strategy should also be a directory structure stategy. For example, a multi-regional sequencing pipeline might arrange directories like this:

DATA
  person1
     N
     T1
     T2
  person2
     N
     T1
     T2
     T3
     etc.

Other structures would be used depending on the project. An RNA-seq project might have multiple biological replicates per sample. A complex project might do DNA sequencing, RNA-seq, methylation sequencing, etc. Each of these might require variations on the directory structure. The pipeline user should be able to specify what that structure will look like.

Pipeline commands should be easily specified without having to list filenames. Say you use joint variant calling across all samples of an individual. You want to specify the basic command and let the system take care of determining which filenames to use, these being a variable number depending on individual, and input filenames at "sample level" (N, T1, etc.), output filename at "person" level.

And here is another desired feature:

Parameters for the different pipeline elements should be easily viewed and edited by a non-bioinformatics expert. They should be organized like the pipeline itself is.

And two more "Bonus points" items:

Provide a lab database system that organizes data from multiple projects. Able to query for mutations found in particular genes across all projects, for example. Or able to extract somatic variants and clinical data on persons in WES projects, for downstream analysis. Interface the database to the pipeline system.
Canned pipeline modules that can be plugged in and used without much work. E.g. a FASTQ trimming module, or a module that analyzes samples from one individual to see if sample mix-up occurred. Beyond the pipeline structure, whose features have been listed, there is the actual use of the pipeline, setting up its different components, and there are many common analyses that should be easy to use by grabbing a pre-developed pipeline module and plugging it into the pipeline for a given project. The pipeline workflow language should allow modules to be connected together, so that the output files from one module would provide the input files (or one type of input files) to another module.

ADD REPLY • link 8.5 years ago by twtoal ▴ 50

0

Entering edit mode

Nice points! I think your number 2. bonus point is what use to be referred to as "sub-workflow" support, or sometimes "sub-network" support.

I really like point #1 and bonus point #1 too ... they make me think ... :)

ADD REPLY • link 8.4 years ago by Samuel Lampa ★ 1.3k

Ram · Answer 6 · 2014-10-17

As with another poster above, I wrote my own tool to handle pipelines in case it's of any interest to anyone. It's called Cluster Flow

It doesn't tick all of your boxes, but is designed to be easy to customise and get running. It uses default cluster queue management systems to handle job hierarchy.

Yes - Allow jobs to be run concurrently
Yes - Support for Sun Grid Engine to launch tasks (out of the box)
Yes - Allow to run any Shell command (with a simple wrapper, example provided)
Yes - Reporting at the end of the run (with timings)
Yes - Reporting during the run
Sort of - Graphical representation of pipeline (text based on command line)
Sort of - Well maintained and support from a community (supported by me)
No (not yet anyway) - Allow jobs to be restarted from where they left off

Cluster Flow best suits small groups / low throughput usage where flexibility is key. It has a shallow learning curve so is good for the less technically minded amongst us :) The core code is written in Perl, but the modules can be any language.

Phil

score 1 · Answer 7 · 2017-02-17

Another option is Cosmos, which has all of the features you mentioned. It is very stable and various groups have used it to process many thousands of genomes. The author works at a large clinical sequencing laboratory.

Features include:

Written in python which is easy to learn, powerful, and popular. A researcher or programmer with limited experience can begin writing Cosmos workflows right away.
Powerful syntax for the creation of complex and highly parallelized workflows.
Reusable recipes and definitions of tools and sub workflows allows for DRY code.
Keeps track of workflows, job information, and resource utilization and provenance in an SQL database.
The ability to visualize all jobs and job dependencies as a convenient image.
Monitor and debug running workflows, and a history of all workflows via a web dashboard.
Alter and resume failed workflows.

Disclosure: I'm the author!