Question: workflow management system : WDL, CWL, Ruffus, SnakeMake, etc
2
gravatar for Bogdan
12 months ago by
Bogdan880
Palo Alto, CA, USA
Bogdan880 wrote:

Dear all,

any comments or suggestions regarding a pipeline "language"/ workflow management system to choose from in order to wrap up some data analysis pipelines on SGE/SLURM clusters ? among possible choices :

-- Snakemake : https://snakemake.readthedocs.io/en/stable/

-- Ruffus : http://www.ruffus.org.uk/

-- WDL : https://software.broadinstitute.org/wdl/

-- CWL : https://www.commonwl.org/

thanks a lot,

-- bogdan

language pipeline • 2.2k views
ADD COMMENTlink modified 12 months ago by WouterDeCoster41k • written 12 months ago by Bogdan880
4

Personally I use Snakemake. There is a thread here discussing this

Snakemake vs. Nextflow: strengths and weaknesses

ADD REPLYlink modified 12 months ago • written 12 months ago by Medhat8.5k
1

My vote goes for snakemake. But I haven't tried the rest, so there is that.

Nevertheless, snakemake is quite broadly adopted, many pipelines which you can adjust for your needs already exist. It's quite readable, easy to get started with and allows throwing in some python code which is convenient. Also it's interaction with the conda package manager is a plus.

ADD REPLYlink written 12 months ago by WouterDeCoster41k
1

I tested snakemake and nextflow. I prefer the later because there is no 'final' target. The workflow is evaluated at runtime after each step.

ADD REPLYlink written 12 months ago by Pierre Lindenbaum123k

Granted that I haven't tried nextflow at all...

I prefer [nextflow] because there is no 'final' target

I like the idea of a final target since it makes explicit what the pipeline is going to produce.

The workflow [of nextflow] is evaluated at runtime after each step

I would say this is a disadvantage since it makes the pipeline less predictable. If a snakemake pipeline runs smoothly in --dryrun mode, you know that all dependencies are satisfied without the need of actually executing them.

(@Bogdan, I'm very happy with snakemake but I haven't the other option you mention)

ADD REPLYlink written 12 months ago by dariober10k
3

I would say this is a disadvantage since it makes the pipeline less predictable

yes that's what I thought at first glance. However imagine the following simple workflow:

  • 1) get the distinct chromosomes in a VCF file
  • 2) for each chromosomes , extract the transcripts containing one or more non-synonymous mutations from an annotated VCF file.
  • 3) for each transcript extract the fasta file from a REFerence genome

The step 2 and 3 are highly parallelizable, however you don't know the list of contigs and transcripts BEFORE you have scanned the VCF file in step 1....

ADD REPLYlink written 12 months ago by Pierre Lindenbaum123k
1

That is one of the disadvantages of pull-based systems. But it has advantages over push-based systems also. I have workflow where I need to dl/process large amounts of data. Then Snakemake's tempfile capabilities are a godsend. To safely delete files, you need to know the DAG.

ADD REPLYlink written 11 months ago by endrebak852100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1847 users visited in the last hour