Question: Workflow management software for pipeline development in NGS
27
gravatar for michaelberinski
2.6 years ago by
United Kingdom
michaelberinski270 wrote:

There was an interesting thread about any experience in automation of NGS pipelines How To Decide Which Software To Use For Building Automated Ngs Analysis Pipeline

Galaxy seemed to be the most prominent tools 3 years ago but had a few drawbacks that were mentioned.

I was wondering if there has been any new software for workflow automatisation that people are using? Whether you experienced any problems or limitations in achieving the commonly required tasks like: 

  • Allow jobs to be run concurrently
  • Allow jobs to be restarted from where they left off 
  • Support for Sun Grid Engine to launch tasks (out of the box)
  • Allow to run any Shell command
  • Reporting at the end of the run (with timings)

 

and optional tasks like:

  • Reporting during the run
  • Graphical representation of pipeline
  • Well maintained and support from a community

Obviously there are plenty of tools, but especially if you can recommend or have particular preference and experience in using any of such software? To mention a few that I had mind:

Snakelike https://code.google.com/p/snakemake/ Python/DSL  
Rufus https://code.google.com/p/ruffus/ Python  
Bpipe https://code.google.com/p/bpipe/ Groovy/DSL  
Galaxy http://galaxyproject.org/ Python/REST API  
BioLite https://bitbucket.org/caseywdunn/biolite Python/C++  
 

 

pipe workflows pipeline workflow • 8.1k views
ADD COMMENTlink modified 3 months ago by egafni30 • written 2.6 years ago by michaelberinski270
3

Not all bioinformatics-related, but a nice list:

https://github.com/pditommaso/awesome-pipeline

Developers here might consider adding their wares to the list.

ADD REPLYlink written 16 months ago by Sean Davis23k

Another python library ready for use in a production environment is Cosmos. Disclosure, I'm the author.

ADD REPLYlink modified 3 months ago • written 3 months ago by egafni30
13
gravatar for paolo.ditommaso
2.6 years ago by
European Union
paolo.ditommaso120 wrote:

I ended up engineering my own pipeline tool called Nextflow because was not happy with none of the frameworks out there. 

Briefly: 

  • It can execute any shell script, command or any mix of them. 
  • It uses a declarative parallelisation model based on the dataflow paradigm. Tasks parallelisation, dependencies and synchronisation is implicitly defined by task inputs/outputs declarations. 
  • On task errors it stops gently, showing the task error cause. User has the chance to reproduce the problem in order to fix it.
  • The pipeline execution can be resume from the last successful executed step. 
  • The same pipeline script can be executed on multiple platforms: single workstation, cluster of computers (SGE,SLURM,LSF,PBS/Torque,DRMAA) and cloud (DnaNexus) 
  • It can produce an execution tracing report with useful tasks runtime information (execution time, mem / cpu used, etc)
  • Notably it integrates the support with Docker. This feature is extremely useful to ship complex binary dependency by using one (or more) Docker images. Nextflow take care of running each task in its own container *transparently*
  • Graphical representation of the pipeline? No (at least for now)   

 

You can find more here http://www.nextflow.io

 

Paolo

ADD COMMENTlink modified 19 months ago • written 2.6 years ago by paolo.ditommaso120

Written in groovy

ADD REPLYlink modified 4 months ago • written 16 months ago by ostrokach230

What's the problem with that?

ADD REPLYlink written 14 months ago by paolo.ditommaso120

Nothing really... just thought I'd point it out.

I spent a bit of time looking through the nextflow website before realizing that it's written in groovy. Most of the code I run is written in C / C++ / Python, so I would rather use a workflow manager written in one of those languages. It makes it easier to track down problems and minimizes heavy dependencies (i.e. Java VM).

For someone who mostly works with Java VM languages, groovy might be a plus.

ADD REPLYlink written 12 months ago by ostrokach230
10
gravatar for Sean Davis
2.6 years ago by
Sean Davis23k
National Institutes of Health, Bethesda, MD
Sean Davis23k wrote:

I use snakemake pretty heavily.  I have not yet found many limitations and the author is quite responsive and active in development.  Support is via a google group and questions are answered pretty quickly in my experience.  For folks familiar with make and python, the learning curve will not be too steep.  

ADD COMMENTlink written 2.6 years ago by Sean Davis23k
2

I second Snakemake. It addresses the weaknesses of Make without abandoning the incredibly powerful implicit wildcard rules.

ADD REPLYlink written 2.6 years ago by Jeremy Leipzig16k
2

Another vote for Snakemake.  I used to use Ruffus but once I gave Snakemake a try I never looked back.  It still amazes me how even complicated workflows can be represented cleanly in Snakemake. It satisfies all the OP's requirements without any problems or limitations that I've found.

ADD REPLYlink written 2.6 years ago by Ryan Dale4.5k
2

Snakemake should be the accepted answer! Found it through this thread and it's awesome. 

ADD REPLYlink modified 12 months ago • written 16 months ago by ostrokach230
9
gravatar for Pierre Lindenbaum
2.6 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum94k wrote:

I use GNU make:

Allow jobs to be run concurrently :  yes https://www.gnu.org/software/make/manual/html_node/Parallel.html


Allow jobs to be restarted from where they left off : yes if you re-run make https://www.gnu.org/software/make/manual/html_node/Prerequisite-Types.html . "if a target’s prerequisite is updated, then the target should also be updated. "


Support for Sun Grid Engine to launch tasks (out of the box) : yes http://gridscheduler.sourceforge.net/htmlman/htmlman1/qmake.html


Allow to run any Shell command : yes and even http://www.gnu.org/software/make/manual/make.html#Choosing-the-Shell


Reporting at the end of the run (with timings): no but easy to implement in the directives

Reporting during the run : no


Graphical representation of pipeline : yes


Well maintained and support from a community : Initial release 1977 , v4.0 released in 2014.

 

ADD COMMENTlink written 2.6 years ago by Pierre Lindenbaum94k
5

I was using make. I really love it. I am, just like Pierre, a great fan of the command line ;). But I must confess that I have been weak, and just felt charmed by the elegance of snakemake and how can it be, following the same philosophy of make, much more simple and readable.

Jokes apart, in my case, multiple-output rules were the key to my transition, although I know make can handle them by using patterns. But my makefiles were far uglier than the current ones. Just to compensate my treason, I am still using vim. :)

ADD REPLYlink written 2.1 years ago by gbayon150

This question is not particularly related to NGS pipelines, but how cluster-friendly is Make? Is there any interaction with Torque?

ADD REPLYlink written 2.6 years ago by David Westergaard1.3k
3

I use qmake (make+ SGE). There is nothing much to do than 'make -j N'. The jobs are dispatched on the cluster.

ADD REPLYlink written 2.6 years ago by Pierre Lindenbaum94k
2
gravatar for Carlos Borroto
2.6 years ago by
Carlos Borroto1.4k
Washington Metropolitan Area
Carlos Borroto1.4k wrote:

I did this search[1] not so long ago and ended choosing Queue.

[1]Which Bioinformatic Friendly Pipeline Building Framework?

 

  • Allow jobs to be run concurrently

Yes.

  • Allow jobs to be restarted from where they left off 

Yes. Not only that it has a retry failed option. Quite useful for cases where the node running the job crashes.

  • Support for Sun Grid Engine to launch tasks (out of the box)

Yes. Direct support out of the box. It also has support for any scheduler with a drmaa lib implementation, ex. Slurm.

  • Allow to run any Shell command

Kind of. You need to write a scala wrapper class for your tool. Quite easy to do so. I had zero experience with scala and very little with java when I started using Queue. I was able to write my own wrappers for bwa mem and a few other tools without much trouble. 

  • Reporting at the end of the run (with timings)

Yes.

I do have some issues with Queue. First the licensing is still a mess. The GATK team uses Appistry for comercial licensing, but Appistry doesn't support Queue. So if you are commercial you have to pay the license, but you will be using a version of GATK embedded into Queue not supported by Appistry. Also, if like me your scala/java experience is limited, while is easy to write simple tools wrappers, things can get complicated fast. For example reusing your wrappers between multiple qscripts isn't as easy as it should be. I still haven't found a way to do this properly. A lot of copied/pasted code in my scripts at the moment.

Having said that Queue has a killer feature for me, Scatter & Gather. Almost every GATK tool has a partitioning type defined for it. Meaning Queue can automatically detect what kind of partitioning of the input data can be done and can launch multiple processes for the same file. This makes amazingly easy to do parallel processing of large input files. Even so when partitioning the data is more complicated than what gnu parallel would be able to do.

 

--Carlos

 

 

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by Carlos Borroto1.4k
2
gravatar for Samuel Lampa
2.6 years ago by
Samuel Lampa1.1k
Stockholm
Samuel Lampa1.1k wrote:

We're still researching options and possibilities, but wanted to chime in with the extended requirements list that we collected, from our own use cases, and what seemed to be things that many others ask for too, in the hope it might help tool makers to not miss any important requirements (in no particular order): 
 

  1. Atomic writes (don't write half-baked data, at least not when using file existence as flag for completion of task)

  2. Integration with HPC resource managers such as SLURM, PBS etc. (Possibly via DRMAAv2)

  3. Hadoop integration?

  4. Stage temporary files to local disk (or separate folder in general)

  5. Streaming mode

  6. Batch mode

  7. Streaming / Batch mode chosen with configurable switch

  8. Don't start too many (OS level) processes (eg. max 1024 on UPPMAX)

  9. Declarative workflow ("dependency graph") specification

  10. Workflow / dependency graph definition separate from processes / task definitions.

  11. Support an explorative usage pattern, by the use of "per request" jobs, that run a specified set of in-data through a specified part of the workflow, up to a specified point in the workflow graph, where it is persisted.

  12. Specify data dependencies (not just task dependencies, as there can be more than one input/output to tasks!)

  13. Be able to restart from existing persisted output from previous task

  14. Be able to run on multiple nodes, with common task scheduler keeping track of dependencies (so no two processes run the same task)

  15. Strategy for file naming (dependent upon task ids, and what makes each separate run unique such as parameters and run ids)

  16. Support workflow execution and triggering based on availability of data

  17. Should support automatic reporting of parameters, runtimes and tool and data versions.

  18. Idempotency: Don’t overwrite existing data, and running it twice should not be different than running it once.

 

Bonus points (would make a serious killer system):

  1. Optimally support a flexible query language, that translates on demand into a dynamically generated data flow network.

  2. Would be nice with a “self-learning” rule engine for deciding the job running time when scheduling, that could, based upon past running times, give an estimate based upon the file size of the current file. [Idea by Ino de Bruijn]

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by Samuel Lampa1.1k
1

As a matter of fact, after (possibly prematurely) settling on Spotify's luigi workflow system, we've done quite some work in the direction of the points above, although we're still far from the goal. See eg these posts, documenting our findings:

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by Samuel Lampa1.1k

For proper reference I need to mention that our experience resulted in the SciLuigi helper library on top of luigi: https://github.com/pharmbio/sciluigi (Though nowadays we are taking the ideas even further with http://scipipe.org).

ADD REPLYlink written 3 days ago by Samuel Lampa1.1k
1
Have you looked into Anduril? Sounds like a fit for most requirements. The new version 2.0 will have Scala as workflow scripting language.
ADD REPLYlink written 2.1 years ago by Christian2.5k
1

I second two of these items in particular: file naming strategy and workflow/dependency declaration. The file naming strategy should also be a directory structure stategy. For example, a multi-regional sequencing pipeline might arrange directories like this:

DATA
  person1
     N
     T1
     T2
  person2
     N
     T1
     T2
     T3
     etc.

Other structures would be used depending on the project. An RNA-seq project might have multiple biological replicates per sample. A complex project might do DNA sequencing, RNA-seq, methylation sequencing, etc. Each of these might require variations on the directory structure. The pipeline user should be able to specify what that structure will look like.

Pipeline commands should be easily specified without having to list filenames. Say you use joint variant calling across all samples of an individual. You want to specify the basic command and let the system take care of determining which filenames to use, these being a variable number depending on individual, and input filenames at "sample level" (N, T1, etc.), output filename at "person" level.

And here is another desired feature:

  1. Parameters for the different pipeline elements should be easily viewed and edited by a non-bioinformatics expert. They should be organized like the pipeline itself is.

And two more "Bonus points" items:

  1. Provide a lab database system that organizes data from multiple projects. Able to query for mutations found in particular genes across all projects, for example. Or able to extract somatic variants and clinical data on persons in WES projects, for downstream analysis. Interface the database to the pipeline system.

  2. Canned pipeline modules that can be plugged in and used without much work. E.g. a FASTQ trimming module, or a module that analyzes samples from one individual to see if sample mix-up occurred. Beyond the pipeline structure, whose features have been listed, there is the actual use of the pipeline, setting up its different components, and there are many common analyses that should be easy to use by grabbing a pre-developed pipeline module and plugging it into the pipeline for a given project. The pipeline workflow language should allow modules to be connected together, so that the output files from one module would provide the input files (or one type of input files) to another module.

ADD REPLYlink modified 22 days ago • written 22 days ago by twtoal40

Nice points! I think your number 2. bonus point is what use to be referred to as "sub-workflow" support, or sometimes "sub-network" support.

I really like point #1 and bonus point #1 too ... they make me think ... :)

ADD REPLYlink written 3 days ago by Samuel Lampa1.1k
1
gravatar for Phil Ewels
2.6 years ago by
Phil Ewels30
Sweden / Stockholm / SciLifeLab
Phil Ewels30 wrote:

As with another poster above, I wrote my own tool to handle pipelines in case it's of any interest to anyone. It's called Cluster Flow: http://ewels.github.io/clusterflow/

It doesn't tick all of your boxes, but is designed to be easy to customise and get running. It uses default cluster queue management systems to handle job hierarchy.

  • Yes - Allow jobs to be run concurrently
  • Yes - Support for Sun Grid Engine to launch tasks (out of the box)
  • Yes - Allow to run any Shell command (with a simple wrapper, example provided)
  • Yes - Reporting at the end of the run (with timings)
  • Yes - Reporting during the run
  • Sort of - Graphical representation of pipeline (text based on command line)
  • Sort of - Well maintained and support from a community (supported by me)
  • No (not yet anyway) - Allow jobs to be restarted from where they left off

Cluster Flow best suits small groups / low throughput usage where flexibility is key. It has a shallow learning curve so is good for the less technically minded amongst us :) The core code is written in Perl, but the modules can be any language.

Phil

ADD COMMENTlink written 2.6 years ago by Phil Ewels30
1
gravatar for egafni
3 months ago by
egafni30
egafni30 wrote:

Another option is Cosmos, which has all of the features you mentioned. It is very stable and various groups have used it to process many thousands of genomes. The author works at a large clinical sequencing laboratory.

Features include:

  • Written in python which is easy to learn, powerful, and popular. A researcher or programmer with limited experience can begin writing Cosmos workflows right away.
  • Powerful syntax for the creation of complex and highly parallelized workflows.
  • Reusable recipes and definitions of tools and sub workflows allows for DRY code.
  • Keeps track of workflows, job information, and resource utilization and provenance in an SQL database.
  • The ability to visualize all jobs and job dependencies as a convenient image.
  • Monitor and debug running workflows, and a history of all workflows via a web dashboard.
  • Alter and resume failed workflows.

Disclosure: I'm the author!

ADD COMMENTlink modified 3 months ago • written 3 months ago by egafni30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 941 users visited in the last hour