Question

How To Organize A Pipeline Of Small Scripts Together?

44

Entering edit mode

15.4 years ago

Giovanni M Dall'Olio 28k

In bioinformatics it is very common to end up with a lot of small scripts, each one with a different scope - plotting a chart, converting a file into another format, execute small operations - so it is very important to have a good way to clue them together, to define which should be executed before the others and so on.

How do you deal with the problem? Do you use makefiles, taverna workflows, batch scripts, or any other solution?

pipeline general • 35k views

ADD COMMENT • link updated 20 months ago by Ram 45k • written 15.4 years ago by Giovanni M Dall'Olio 28k

Ram · Answer 1 · 2010-02-26

23

Entering edit mode

15.4 years ago

Giovanni M Dall'Olio 28k

My favorite way of defining pipelines is by writing Makefiles, about which you can find a very good introduction in Software Carpentry for Bioinformatics: http://software-carpentry.org/v4/make/.

Although they have been originally developed for compiling programs, Makefiles allow to define which operations are needed to create each file, with a declarative syntax that it is a bit old-style but still does its job. Each Makefile is composed of a set of rules, which define operations needed to calculate a file and that can be combined together to make a pipeline. Other advantages of makefiles are conditional execution of tasks, so you can stop the execution of a pipeline and get back to it later, without having to repeat calculations. However, one of the big disadvantages of Makefiles is its old syntax... in particular, rules are identified by the names of the files that they create, and there is no such thing as 'titles' for rules, which make more tricky.

I think one of the best solutions would be to use BioMake, that allow to define tasks with titles that are not the name of the output files. To understand it better, look at this example: you see that each rule has a title and a series of parameters like its output, inputs, comments, etc.

Unfortunately, I can't make biomake to run on my computer, as it requires very old dependencies and it is written in a very difficult perl. I have tried many alternatives and I think that rake is the one that is more close to biomake, but unfortunately I don't understand ruby's syntax.

So, I am still looking for a good alternative... Maybe one day I will have to time to re-write BioMake in python :-)

ADD COMMENT • link updated 5.8 years ago by Ram 45k • written 15.4 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

can you do 'loops' with (bio)make ?

ADD REPLY • link 15.4 years ago by Pierre Lindenbaum 166k

0

Entering edit mode

no idea :-( Unfortunately I could never make it working. Anyway, since these are usually declarative-like syntax, you don't make loops, you just apply a function on an array of values (e.g. like in R with the apply function)

ADD REPLY • link 15.4 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Could you please check once that above link you mention is not working. I am going to start to build up genome assembly+analysis pipeline. So i think above link may be helpful for me . Or else could you please suggest something for beginners

ADD REPLY • link 11.6 years ago by HG ★ 1.2k

0

Entering edit mode

I've updated the link, changing it to the new version of the tutorial in the latest Software Carpentry website. However, I liked the original version better (see http://web.archive.org/web/20120624090322/http://swc.scipy.org/lec/build.html )

ADD REPLY • link 11.6 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Thank you so much for your update ....could you please suggest me something for beginners . I have stared my PhD just 6months ..

ADD REPLY • link 11.6 years ago by HG ★ 1.2k

score 18 · Answer 2 · 2010-02-26

My answer would be: don't bother. I've often found that much of the scripts I write are never used again after the initial use. Therefore spending time using a complex framework that considers dependency between scripts is a waste because the results might be negative and you never visit the analysis again. Even if you do end up using the script multiple times a simple hacky bash script might be more than enough to meet the requirements.

There will however be the 1-2% of initial analyses that return a interesting result and therefore need to be expanded with more deeper investigation. I think this is the point to invest more time time in organising the project. For me I use Rake because it's simple and allows me to write in the language I'm used to (Ruby).

Overall I think pragmatism is the important factor in computational biology. Just do enough to get the results you need and only invest more time when it's necessary. There's so many blind alleys in computational analysis of biological data it's not worth investing too much of your time until it's necessary.

score 17 · Answer 3 · 2010-03-01

17

Entering edit mode

15.4 years ago

Eric T. ★ 2.9k

The most important thing for me has been keeping a README file at the top of each project directory, where I write down not just how to run the scripts, but why I wrote them in the first place -- coming back to a project after a several-month lull, it's remarkable difficult to figure out what all the half-finished results mean without detailed notes.

That said:

make is pretty handy for simple pipelines that need to be re-run a lot
I'm also intrigued by waf and scons, since I use Python a lot
If a pipeline only takes a couple of minutes to run, and you only re-run it every few days, coercing it into a build system doesn't really save time overall for that project
But once you're used to working with a build system, the threshold where it pays off to use it on a new project drops dramatically

ADD COMMENT • link 15.4 years ago by Eric T. ★ 2.9k

2

Entering edit mode

Thanks. I think waf is too much oriented towards compiling programs, I have tried to use it but I don't think it is very useful for what we need to do. As for your third point, I use a lot of .Phony targets, which means that I just use a generic name (e.g. align_sequences, get_data, calculate_x) and the list of commands, with few dependencies. A bit like how I have shown in these slides: http://bioinfoblog.it/?p=29

ADD REPLY • link 15.4 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

interesting frameworks

ADD REPLY • link 15.4 years ago by Istvan Albert 102k

score 10 · Answer 4 · 2010-03-04

This article may shed light onto how to organise bioinformatics projects. William Stafford Noble. "A quick guide to organizing computational biology experiments." PLoS Computational Biology. 5(7):e1000424, 2009

link: http://noble.gs.washington.edu/papers/noble2009quick.html

For me using git and the directory structure that Bill Noble mentiones in this articles has been a better approach than what I had before.

Ram · Answer 5 · 2010-02-26

9

Entering edit mode

15.4 years ago

Istvan Albert 102k

I don't have personal experience with this package but it is something that I plan to explore in the near future:

Ruffus a lightweight python module to run computational pipelines.

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 15.4 years ago by Istvan Albert 102k

3

Entering edit mode

I am the developer of Ruffus. This is designed to be a "make" replacement with simpler(!) syntax. I wanted all the power of make and more while writing "normal" python scripts.

Ruffus has undergone substantial development lately, especially to simplify the syntax and improve the error messages. If your experience was with the ruffus of a few months ago, it might be worth having a quick look again and seeing if it has improved enough to change your mind. (Or it just may not be your cup of tea!) As always suggestions (even critical comments) are welcome.

ADD REPLY • link 15.3 years ago by Leo Goodstadt ▴ 50

0

Entering edit mode

I have tried ruffus extensively, but in the end I decided I don't like its syntax. It complicates python's syntax and in the end, I recoded my pipeline as a simple python script, because I was getting errors that I couldn't understand well. What I am really looking for is something on the style of biomake/skam: http://skam.sourceforge.net/skam-intro.html

ADD REPLY • link updated 11.8 years ago by Istvan Albert 102k • written 15.4 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

I think Snakemake could fit this bill.

(Personally I'll try Ruffos as well)

ADD REPLY • link updated 5.9 years ago by Ram 45k • written 9.7 years ago by ajasja.ljubetic • 0

score 8 · Answer 6 · 2010-03-07

8

Entering edit mode

15.3 years ago

brentp 24k

I have done some with ruffus but have reverted back to using make files. Ruffus is a nice idea and implemented very nicely, but often, in a pipeline, I just want to overwrite files--since I'm exploring and making lots of changes and mistakes. With rufus, I found I was spending a lot of time tracking down which files had/had not changed. For some reason, it's easier for me to deal with a Makefile, with or without using dependencies. YMMV. I just order the make with steps that come first at the top and add stuff to the bottom as I extend the pipeline. This is very simple, but works well for now. I'm interested to see what other responses are added here.

As others have mentioned, a README.txt and documentation at the top of the script are a good idea. Also, for any script that takes more than 2 arguments, use a getopt equivalent e.g. optparse in python).

Finally, extract as much code as possible into tested/test-able libraries.

ADD COMMENT • link 15.3 years ago by brentp 24k

2

Entering edit mode

I am the developer of Ruffus (and just posted a reply to Istvan above).

Ruffus has recently acquired much more tracing (like make -n but understandable), and a "touch" mode (like make -t you can update selected parts of the pipeline).

I am always interested from people still using make as it was the frustrations of unmaintainable makefiles which drove us to develop Ruffus in the first place. I would be very grateful if you could email me more comments / feedback if you don't mind. Thanks

ADD REPLY • link 15.3 years ago by Leo Goodstadt ▴ 50

0

Entering edit mode

I guess it is hard to improve on a time tested approach such as make. Thanks for the review of ruffus.

ADD REPLY • link 15.3 years ago by Istvan Albert 102k

0

Entering edit mode

Try argparse (http://code.google.com/p/argparse/) which is an extended argument parse library for python. I had a similar experience with ruffus, as I have tried it and then reverted to makefiles. I think the est is still biomake.

ADD REPLY • link 15.3 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Leo, thanks for the comment. I will check out Ruffus again and get back to you with comments. The trace and touch modes do sound good.

ADD REPLY • link 15.3 years ago by brentp 24k

score 7 · Answer 7 · 2010-03-04

7

Entering edit mode

15.3 years ago

Madelaine Gogol 5.3k

I generally use a simple shell script if I have multiple commands or scripts to run. I also try to make a notes.txt file to remind myself of what I did. Doesn't take long and comes in handy.

Things that you didn't plan to re-use get re-used all the time, in my experience...

ADD COMMENT • link 15.3 years ago by Madelaine Gogol 5.3k

2

Entering edit mode

I'll try it. I actually just googled "makefile bioinformatics" and found this, which was helpful: http://www.slideshare.net/giovanni/makefiles-bioinfo

ADD REPLY • link 15.3 years ago by Madelaine Gogol 5.3k

1

Entering edit mode

try makefiles, it basically the same as shell scripts but you can define more than a task in a file.

ADD REPLY • link 15.3 years ago by Giovanni M Dall'Olio 28k

score 6 · Answer 8 · 2010-02-26

6

Entering edit mode

15.4 years ago

Chris ★ 1.6k

Since I work a lot with Python, I usually write a wrapper method that embeds the external script/program, i.e. calls it, parses its output and returns the desired information. The 'glueing' of several such methods then takes place within my Python code that calls all these wrappers. I guess that's a very common thing to do.

Chris

ADD COMMENT • link 15.4 years ago by Chris ★ 1.6k

0

Entering edit mode

I know it but since I have starting use make, I can't do anything without some certain features. For example, if you write a 'glue' script, you don't have conditional execution of tasks, so you will always have to run all the pipeline at once, without the possibility of pausing it. Or again, with makefiles, if you change only one of the input files, the program will only run the steps that are necessary to obtain the results, while a batch script will re-run everything. Moreover, a Makefile script has a standard syntax and it is easier to understand what is happening.

ADD REPLY • link 15.4 years ago by Giovanni M Dall'Olio 28k

score 6 · Answer 9 · 2010-03-02

6

Entering edit mode

15.4 years ago

hadasa ★ 1.0k

Since i use Ruby quite often, I have found Rake very useful in creating simple pipelines. Rake has an idea of a task(s) and you can have prerequisites for the tasks, thus create pipelines. See An extension to rake that can be used to build database-backed workflows — at github http://github.com/jandot/biorake

ADD COMMENT • link 15.4 years ago by hadasa ★ 1.0k

0

Entering edit mode

Wow! I'm so glad you linked this! I've been looking for something just like BioRake. I, too, use rake tasks to automate a lot of stuff. I also really like using Rails in general, because it's incredibly easy to generate data displays and put my projects on the web.

ADD REPLY • link 14.8 years ago by Mohawkjohn ▴ 30

Ram · Answer 10 · 2010-03-06

4

Entering edit mode

15.3 years ago

Perry ▴ 290

Most of my work is in python, so I use paver, which is similar to makefiles or rake for ruby, but gives you access to all python libraries.

ADD COMMENT • link 15.3 years ago by Perry ▴ 290

0

Entering edit mode

paver does look like a good option, falling somewhere between a makefile and ruffus.

ADD REPLY • link updated 5.9 years ago by Ram 45k • written 15.3 years ago by brentp 24k

score 3 · Answer 11 · 2010-03-02

3

Entering edit mode

15.4 years ago

Darked89 4.7k

re Biomake:

It does look like a great tool but it is unsupported. What you get from Sourceforge is a snapshot with README pointing you to one dialect of Prolog (XSB) only to learn running the examples that project moved to another one (SWI-Prolog). Unless you know Prolog and can fix it Biomake is not functional as I last checked (Jan 2010).

ADD COMMENT • link 15.4 years ago by Darked89 4.7k

0

Entering edit mode

Yes, I have checked it on January 2009 and it was the same as it is now. I also wrote to the author and he confirmed that he is working on a different project now, and he doesn't plan to work on biomake soon.

ADD REPLY • link 15.4 years ago by Giovanni M Dall'Olio 28k

Ram · Answer 12 · 2011-06-14

Something I've found useful in python is to set up a framework that allows me to run specific functions in a module from the command line. This allows all the similar scripts (usually data aggregation, plotting, and analysis) to be put into a single module and each step ran from the command line. If I want to make a pipeline I just use a simple shell script.

The following code allows for any function to be ran from the command line just by passing the fxn's name as the first argument after the script. So for example, if you had a module named myModule.py with two fxns: findSpliceSites() and pretendToWork(), you can just do:

python myModule.py pretendToWork arg1 arg2

and

if __name__ == "__main__":
    import sys 
    submitArgs(globals()[sys.argv[1]], sys.argv[1:])

submitArgs is just a fxn that takes takes a function as an argument, and that function's arguments and runs it.

score 3 · Answer 13 · 2011-11-22

3

Entering edit mode

13.6 years ago

Manu Prestat 4.1k

This is exactly the philosophy of the biopieces project.

ADD COMMENT • link 13.6 years ago by Manu Prestat 4.1k

score 3 · Answer 14 · 2013-10-14

3

Entering edit mode

11.7 years ago

Sean Davis 27k

There are a lot of great approaches here. A relatively new player in this arena is Snakemake:

https://bitbucket.org/johanneskoester/snakemake/

It is make-like, but has a lot of niceties for working in cluster environments (not that it requires a cluster).

ADD COMMENT • link 11.7 years ago by Sean Davis 27k

Ram · Answer 15 · 2013-10-16

3

Entering edit mode

11.7 years ago

marko.k.laakso ▴ 80

Anduril anduril.org) is a language independent workflow engine, which provides pretty advanced mechanisms for the iterative development and analysis. The system keeps track of the valid results and re-executes parts, which are affected by changes in their inputs, parameters or implementations. All intermediate results are accessible to the end user and the system takes care of the fault tolerance and the parallelization of the execution (its also possible to use remote hosts and cluster queues).

Anduril workflows are constructed with AndurilScript, which enables the use of conditional operations and responses to the outputs. The existing scripts and programs can be invoked using components such as StandardProcess and BashEvaluate but there are also hundreds of ready made components for various tasks in bioinformatics.

ADD COMMENT • link 11.7 years ago by marko.k.laakso ▴ 80

0

Entering edit mode

I second the use of Anduril because it is essentially "feature-complete". What I like most is that you do procedural programming in it (loops, functions) and parallel execution (on a cluster or locally) works like magic. The initial learning curve is a bit steep though. I recently summarized my experience with it in a few slides: http://www.slideshare.net/ChristianFrech/reproducible-bioinformatics-pipelines-with-docker-and-anduril

ADD REPLY • link 9.7 years ago by Christian ★ 3.1k

0

Entering edit mode

I looked at your slides. It is pretty clear from them that the value of anduril is not in the pipeline component but in the reporting component, no?

I find that very confusing - is it a pipeline or a result formatting and reporting tool? It seems ultimately a black box, data goes in on one end, no customization whatsoever then plots come out on the other side. I my experience this is not a sustainable model of science, I dearly wish it worked that way but it does not really.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 9.7 years ago by Istvan Albert 102k

0

Entering edit mode

At its core, Anduril is just a general-purpose pipeline scripting language like many others (Snakemake, Bpipe, Ruffus, GNU make, etc.), plus some nice little extras (like reporting or ready-made components for many areas in bioinformatics). One key difference is the programming language, which is currently AndurilScript but will be Scala in version 2. The possibility to compile pretty reports from component outputs is just a bonus not even built into the core of the framework but provided by auxiliary components (which you can but don't have to use).

Components are not meant to hide anything but rather to allow modular code re-use. This is much like writing functions or classes in other programming languages. The source code of all components (typically written in Bash, R, Python, or Perl) is publicly available on Bitbucket and thus readily customizable. Whenever I find that available components do not fit my needs (which happens quite frequently actually), I just modify them, write my own, code directly in AndurilScript, or embed native Bash/R/Perl/Python code into the workflow script using the corresponding 'Evaluate' components. Thus, as a developer, you remain always in full control over the complete workflow, down to the very last parameter (I would not use it otherwise :-)

ADD REPLY • link 9.7 years ago by Christian ★ 3.1k

Ram · Answer 16 · 2014-01-22

I would suggest also bpipe that "...provides a platform for running big bioinformatics jobs that consist of a series of processing stages - known as 'pipelines'.". It seems intriguing, anybody have experience with it ?

Feature comparison:

Tool                                    Bpipe   Ruffus  Galaxy  Taverna Pegasus
GUI                                     No      No      Yes     Yes     Yes
Command Line (*)                        Yes     Yes     No      No      Yes
Audit Trail                             Yes     Yes     Yes     Yes     Yes
Built in Cluster Support                Yes     No      Yes     Yes     Yes
Workflow Sharing                        No      No      Yes     Yes     Yes
Online Data Source Integration          No      No      Yes     Yes     Yes
Need Programming Knowledge?             No      Yes     No      No      Yes
Easy Shell Script Portability           Yes     No      No      No      No

score 2 · Answer 17 · 2011-06-14

Another option is to use something like eHive: http://www.ncbi.nlm.nih.gov/pubmed/20459813
The simplest interface is ensembl-hive/scripts/cmd_hive.pl to submit your scripts as "ad-hoc analysis". After that, there are control rules to define something like: "run this clustering script that generates 10.000 clusters, then run a second script on each of them for as many CPUs available as possible, then wait until all have finished, then run a third script on the results.

score 2 · Answer 18 · 2011-06-14

If you believe you are going to use your pipeline a fair amount over time I would recommend rake regardless of what language it's written in. If you believe other people would be interested in using your pipeline consider forking the Galaxy project and converting your scripts into modules for Galaxy. I'm finding it's not as pretty from the developers end as the users end, but still usable.

score 1 · Answer 19 · 2016-04-20

1

Entering edit mode

9.2 years ago

chen ★ 2.5k

I think Common Workflow Language may be a good choice

It is developed by SBG

https://github.com/common-workflow-language/workflows

ADD COMMENT • link 9.2 years ago by chen ★ 2.5k

0

Entering edit mode

Actually CWL is a community led initiative - over 30 academic and industrial organizations have contributed to the specification. Seven Bridges (SBG) is one of those organizations. We use CWL to run workflows on our cloud platform. Also, our SDK enables users to bring their command line tools to our platform by creating CWL wrappers for them. Our experience with CWL and our SDK led to the http://rabix.io/ open source initiative, which includes two tools, the Rabix Composer, an IDE for CWL and the Rabix Executor, for running CWL locally and at scale. Both tools are in Beta, with 1.0 release planned for later in the summer.

ADD REPLY • link 8.0 years ago by adrian.sharma ▴ 20

score 0 · Answer 20 · 2016-04-20

I have been looking for a good solution to this problem since working on my PhD (late 90s) and tried various approaches over the years. We've recently developed NextflowWorkbench as a practical way to build and reuse workflows. The platform offers both user interface and scripting and is designed for both bioinformaticians who write workflows, and users of the workflows (biologists with limited computational experience). Workflows run on laptop all the way to clusters provisioned in the cloud. See http://workflow.campagnelab.org and this preprint. Compared to other systems and tools, NW helps with reproducibility by providing ways to automatically install the tools you need on the machine where the workflow will run.

We aim for the interface to be like using a commercial integrated development environment (very similar to JetBrains tools: IDEA/PyCharm, etc, in fact it is built on the JetBrains IDEA platform).