Question

workflow management system : WDL, CWL, Ruffus, SnakeMake, etc

5

Entering edit mode

5.5 years ago

Bogdan ★ 1.4k

Dear all,

any comments or suggestions regarding a pipeline "language"/ workflow management system to choose from in order to wrap up some data analysis pipelines on SGE/SLURM clusters ? among possible choices :

-- Snakemake : https://snakemake.readthedocs.io/en/stable/

-- Ruffus : http://www.ruffus.org.uk/

-- WDL : https://software.broadinstitute.org/wdl/

-- CWL : https://www.commonwl.org/

thanks a lot,

-- bogdan

pipeline language • 12k views

ADD COMMENT • link updated 24 months ago by dariober 14k • written 5.5 years ago by Bogdan ★ 1.4k

4

Entering edit mode

Personally I use Snakemake. There is a thread here discussing this

Snakemake vs. Nextflow: strengths and weaknesses

ADD REPLY • link 5.5 years ago by Medhat 9.7k

1

Entering edit mode

My vote goes for snakemake. But I haven't tried the rest, so there is that.

Nevertheless, snakemake is quite broadly adopted, many pipelines which you can adjust for your needs already exist. It's quite readable, easy to get started with and allows throwing in some python code which is convenient. Also it's interaction with the conda package manager is a plus.

ADD REPLY • link 5.5 years ago by WouterDeCoster 47k

1

Entering edit mode

I tested snakemake and nextflow. I prefer the later because there is no 'final' target. The workflow is evaluated at runtime after each step.

ADD REPLY • link 5.5 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

Granted that I haven't tried nextflow at all...

I prefer [nextflow] because there is no 'final' target

I like the idea of a final target since it makes explicit what the pipeline is going to produce.

The workflow [of nextflow] is evaluated at runtime after each step

I would say this is a disadvantage since it makes the pipeline less predictable. If a snakemake pipeline runs smoothly in --dryrun mode, you know that all dependencies are satisfied without the need of actually executing them.

(@Bogdan, I'm very happy with snakemake but I haven't the other option you mention)

ADD REPLY • link 5.5 years ago by dariober 14k

5

Entering edit mode

I would say this is a disadvantage since it makes the pipeline less predictable

yes that's what I thought at first glance. However imagine the following simple workflow:

1) get the distinct chromosomes in a VCF file
2) for each chromosomes , extract the transcripts containing one or more non-synonymous mutations from an annotated VCF file.
3) for each transcript extract the fasta file from a REFerence genome

The step 2 and 3 are highly parallelizable, however you don't know the list of contigs and transcripts BEFORE you have scanned the VCF file in step 1....

ADD REPLY • link 5.5 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

That is one of the disadvantages of pull-based systems. But it has advantages over push-based systems also. I have workflow where I need to dl/process large amounts of data. Then Snakemake's tempfile capabilities are a godsend. To safely delete files, you need to know the DAG.

ADD REPLY • link 5.5 years ago by endrebak852 ▴ 110

1

Entering edit mode

3.5 years later I bumped into this thread again... Pierre, I haven't thought about this thoroughly but snakemake's checkpoints should accommodate the scenario you describe where you don't know before-hand the output of a rule.

ADD REPLY • link 24 months ago by dariober 14k

score 0 · Answer 1 · 2022-02-05

0

Entering edit mode

2.2 years ago

Eugene A ▴ 180

I have realised some nasty thing about wdl in comparison to snakemake, but may be it is just my inablity to correcly setup the system, so comments apprecitaed. In snakemake runtime attributes works out of the box, meanin that passing -- cores 10 to the workflow and secifying threads: 5 in the roule description ensure that nor more then 2 instance of the particular roule would run simultaniously

Cromwell/wdl does not pickup parameters from runtime section, like cores at least while running on a local backend (docker backend) without SLURM, or other cluster manager (first table here https://cromwell.readthedocs.io/en/stable/RuntimeAttributes/ ). Which means that if you have just a server you have no means to control resource sharing between running tasks. Correct me if I'm wrong (I really wish I am and there is a solution to that :))

That goes strongly against wdl, but unfortunatly it is a broad language

Best, Eugene

ADD COMMENT • link 2.2 years ago by Eugene A ▴ 180

0

Entering edit mode

this should be asked as a new question, not as a reply.

ADD REPLY • link 2.2 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

I've thought that this observation might be important to someone reading that post to decide which language to use (and according to post statistic it is extremely popular). That pitfall is not what you find out by reading quick start gude but nevertheless extremely important later, when you already write you pipline and decided to perform a workload test)

ADD REPLY • link 2.2 years ago by Eugene A ▴ 180