I ended up engineering my own pipeline tool called Nextflow because was not happy with none of the frameworks out there.
It can execute any shell script, command or any mix of them.
It uses a declarative parallelisation model based on the dataflow paradigm. Tasks parallelisation, dependencies and synchronisation is implicitly defined by task inputs/outputs declarations.
On task errors it stops gently, showing the task error cause. User has the chance to reproduce the problem in order to fix it.
The pipeline execution can be resume from the last successful executed step.
The same pipeline script can be executed on multiple platforms: single workstation, cluster of computers (SGE,SLURM,LSF,PBS/Torque,DRMAA) and cloud (DnaNexus)
It can produce an execution tracing report with useful tasks runtime information (execution time, mem / cpu used, etc)
Notably it integrates the support with Docker. This feature is extremely useful to ship complex binary dependency by using one (or more) Docker images. Nextflow take care of running each task in its own container transparently
Graphical representation of the pipeline? No (at least for now)
I use snakemake pretty heavily. I have not yet found many limitations and the author is quite responsive and active in development. Support is via a google group and questions are answered pretty quickly in my experience. For folks familiar with make and python, the learning curve will not be too steep.
Allow jobs to be restarted from where they left off
Yes. Not only that it has a retry failed option. Quite useful for cases where the node running the job crashes.
Support for Sun Grid Engine to launch tasks (out of the box)
Yes. Direct support out of the box. It also has support for any scheduler with a drmaa lib implementation, ex. Slurm.
Allow to run any Shell command
Kind of. You need to write a scala wrapper class for your tool. Quite easy to do so. I had zero experience with scala and very little with java when I started using Queue. I was able to write my own wrappers for bwa mem and a few other tools without much trouble.
Reporting at the end of the run (with timings)
I do have some issues with Queue. First the licensing is still a mess. The GATK team uses Appistry for comercial licensing, but Appistry doesn't support Queue. So if you are commercial you have to pay the license, but you will be using a version of GATK embedded into Queue not supported by Appistry. Also, if like me your scala/java experience is limited, while is easy to write simple tools wrappers, things can get complicated fast. For example reusing your wrappers between multiple qscripts isn't as easy as it should be. I still haven't found a way to do this properly. A lot of copied/pasted code in my scripts at the moment.
Having said that Queue has a killer feature for me, Scatter & Gather. Almost every GATK tool has a partitioning type defined for it. Meaning Queue can automatically detect what kind of partitioning of the input data can be done and can launch multiple processes for the same file. This makes amazingly easy to do parallel processing of large input files. Even so when partitioning the data is more complicated than what gnu parallel would be able to do.
We're still researching options and possibilities, but wanted to chime in with the extended requirements list that we collected, from our own use cases, and what seemed to be things that many others ask for too, in the hope it might help tool makers to not miss any important requirements (in no particular order):
Atomic writes (don't write half-baked data, at least not when using file existence as flag for completion of task)
Integration with HPC resource managers such as SLURM, PBS etc. (Possibly via DRMAAv2)
Stage temporary files to local disk (or separate folder in general)
Streaming / Batch mode chosen with configurable switch
Don't start too many (OS level) processes (eg. max 1024 on UPPMAX)
Workflow / dependency graph definition separate from processes / task definitions.
Support an explorative usage pattern, by the use of "per request" jobs, that run a specified set of in-data through a specified part of the workflow, up to a specified point in the workflow graph, where it is persisted.
Specify data dependencies (not just task dependencies, as there can be more than one input/output to tasks!)
Be able to restart from existing persisted output from previous task
Be able to run on multiple nodes, with common task scheduler keeping track of dependencies (so no two processes run the same task)
Strategy for file naming (dependent upon task ids, and what makes each separate run unique such as parameters and run ids)
Support workflow execution and triggering based on availability of data
Should support automatic reporting of parameters, runtimes and tool and data versions.
Idempotency: Don't overwrite existing data, and running it twice should not be different than running it once.
Bonus points (would make a serious killer system):
Optimally support a flexible query language, that translates on demand into a dynamically generated data flow network.
Would be nice with a "self-learning" rule engine for deciding the job running time when scheduling, that could, based upon past running times, give an estimate based upon the file size of the current file. [Idea by Ino de Bruijn]
Sort of - Graphical representation of pipeline (text based on command line)
Sort of - Well maintained and support from a community (supported by me)
No (not yet anyway) - Allow jobs to be restarted from where they left off
Cluster Flow best suits small groups / low throughput usage where flexibility is key. It has a shallow learning curve so is good for the less technically minded amongst us :) The core code is written in Perl, but the modules can be any language.
Another option is Cosmos, which has all of the features you mentioned. It is very stable and various groups have used it to process many thousands of genomes. The author works at a large clinical sequencing laboratory.
Written in python which is easy to learn, powerful, and popular. A researcher or programmer with limited experience can begin writing Cosmos workflows right away.
Powerful syntax for the creation of complex and highly parallelized workflows.
Reusable recipes and definitions of tools and sub workflows allows for DRY code.
Keeps track of workflows, job information, and resource utilization and provenance in an SQL database.
The ability to visualize all jobs and job dependencies as a convenient image.
Monitor and debug running workflows, and a history of all workflows via a web dashboard.
Not all bioinformatics-related, but a nice list:
Developers here might consider adding their wares to the list.
Another python library ready for use in a production environment is Cosmos. Disclosure, I'm the author.