Question: How To Organize A Pipeline Of Small Scripts Together?
gravatar for Giovanni M Dall'Olio
8.3 years ago by
London, UK
Giovanni M Dall'Olio26k wrote:

In bioinformatics it is very common to end up with a lot of small scripts, each one with a different scope - plotting a chart, converting a file into another format, execute small operations - so it is very important to have a good way to clue them together, to define which should be executed before the others and so on.

How do you deal with the problem? Do you use makefiles, taverna workflows, batch scripts, or any other solution?

pipeline general • 21k views
ADD COMMENTlink modified 2.2 years ago by chen1.6k • written 8.3 years ago by Giovanni M Dall'Olio26k
gravatar for Giovanni M Dall'Olio
8.3 years ago by
London, UK
Giovanni M Dall'Olio26k wrote:

My favorite way of defining pipelines is by writing Makefiles, about which you can find a very good introduction in Software Carpentry for Bioinformatics: .

Although they have been originally developed for compiling programs, Makefiles allow to define which operations are needed to create each file, with a declarative syntax that it is a bit old-style but still does its job. Each Makefile is composed of a set of rules, which define operations needed to calculate a file and that can be combined together to make a pipeline. Other advantages of makefiles are conditional execution of tasks, so you can stop the execution of a pipeline and get back to it later, without having to repeat calculations. However, one of the big disadvantages of Makefiles is its old syntax... in particular, rules are identified by the names of the files that they create, and there is no such thing as 'titles' for rules, which make more tricky.

I think one of the best solutions would be to use BioMake, that allow to define tasks with titles that are not the name of the output files. To understand it better, look at this example: you see that each rule has a title and a series of parameters like its output, inputs, comments, etc.

Unfortunately, I can't make biomake to run on my computer, as it requires very old dependencies and it is written in a very difficult perl. I have tried many alternatives and I think that rake is the one that is more close to biomake, but unfortunately I don't understand ruby's syntax.

So, I am still looking for a good alternative... Maybe one day I will have to time to re-write BioMake in python :-)

ADD COMMENTlink modified 4.6 years ago • written 8.3 years ago by Giovanni M Dall'Olio26k

can you do 'loops' with (bio)make ?

ADD REPLYlink written 8.3 years ago by Pierre Lindenbaum108k

no idea :-( Unfortunately I could never make it working. Anyway, since these are usually declarative-like syntax, you don't make loops, you just apply a function on an array of values (e.g. like in R with the apply function)

ADD REPLYlink written 8.3 years ago by Giovanni M Dall'Olio26k

Could you please check once that above link you mention is not working. I am going to start to build up genome assembly+analysis pipeline. So i think above link may be helpful for me . Or else could you please suggest something for beginners

ADD REPLYlink written 4.6 years ago by HG1.1k

I've updated the link, changing it to the new version of the tutorial in the latest Software Carpentry website. However, I liked the original version better (see )

ADD REPLYlink written 4.6 years ago by Giovanni M Dall'Olio26k

Thank you so much for your update ....could you please suggest me something for beginners . I have stared my PhD just 6months ..

ADD REPLYlink written 4.6 years ago by HG1.1k
gravatar for Michael Barton
8.3 years ago by
Michael Barton1.8k
Akron, Ohio, United States
Michael Barton1.8k wrote:

My answer would be: don't bother. I've often found that much of the scripts I write are never used again after the initial use. Therefore spending time using a complex framework that considers dependency between scripts is a waste because the results might be negative and you never visit the analysis again. Even if you do end up using the script multiple times a simple hacky bash script might be more than enough to meet the requirements.

There will however be the 1-2% of initial analyses that return a interesting result and therefore need to be expanded with more deeper investigation. I think this is the point to invest more time time in organising the project. For me I use Rake because it's simple and allows me to write in the language I'm used to (Ruby).

Overall I think pragmatism is the important factor in computational biology. Just do enough to get the results you need and only invest more time when it's necessary. There's so many blind alleys in computational analysis of biological data it's not worth investing too much of your time until it's necessary.

ADD COMMENTlink written 8.3 years ago by Michael Barton1.8k

well, a Makefile with phony rules is not much difficult to write. Even if you use a script only once, it is useful to write down the options and the files on which you have launched it. Moreover, sometimes you use binary files like the ones from emboss or blast, which have lot of options, and you need to annotate the options you have used to make your results reproducible.

ADD REPLYlink written 8.3 years ago by Giovanni M Dall'Olio26k
gravatar for Eric T.
8.3 years ago by
Eric T.2.2k
San Francisco, CA
Eric T.2.2k wrote:

The most important thing for me has been keeping a README file at the top of each project directory, where I write down not just how to run the scripts, but why I wrote them in the first place -- coming back to a project after a several-month lull, it's remarkable difficult to figure out what all the half-finished results mean without detailed notes.

That said:

  • make is pretty handy for simple pipelines that need to be re-run a lot
  • I'm also intrigued by waf and scons, since I use Python a lot
  • If a pipeline only takes a couple of minutes to run, and you only re-run it every few days, coercing it into a build system doesn't really save time overall for that project
  • But once you're used to working with a build system, the threshold where it pays off to use it on a new project drops dramatically
ADD COMMENTlink written 8.3 years ago by Eric T.2.2k

Thanks. I think waf is too much oriented towards compiling programs, I have tried to use it but I don't think it is very useful for what we need to do. As for your third point, I use a lot of .Phony targets, which means that I just use a generic name (e.g. align_sequences, get_data, calculate_x) and the list of commands, with few dependencies. A bit like how I have shown in these slides:

ADD REPLYlink written 8.3 years ago by Giovanni M Dall'Olio26k

interesting frameworks

ADD REPLYlink written 8.3 years ago by Istvan Albert ♦♦ 77k
gravatar for Manuel Corpas
8.3 years ago by
Manuel Corpas650
Manuel Corpas650 wrote:

This article may shed light onto how to organise bioinformatics projects. William Stafford Noble. "A quick guide to organizing computational biology experiments." PLoS Computational Biology. 5(7):e1000424, 2009


For me using git and the directory structure that Bill Noble mentiones in this articles has been a better approach than what I had before.

ADD COMMENTlink written 8.3 years ago by Manuel Corpas650
gravatar for Istvan Albert
8.3 years ago by
Istvan Albert ♦♦ 77k
University Park, USA
Istvan Albert ♦♦ 77k wrote:

I don't have personal experience with this package but it is something that I plan to explore in the near future:

Ruffus a lightweight python module to run computational pipelines.

ADD COMMENTlink written 8.3 years ago by Istvan Albert ♦♦ 77k

I am the developer of Ruffus. This is designed to be a "make" replacement with simpler(!) syntax. I wanted all the power of make and more while writing "normal" python scripts.

Ruffus has undergone substantial development lately, especially to simplify the syntax and improve the error messages. If your experience was with the ruffus of a few months ago, it might be worth having a quick look again and seeing if it has improved enough to change your mind. (Or it just may not be your cup of tea!) As always suggestions (even critical comments) are welcome.

ADD REPLYlink written 8.2 years ago by Leo Goodstadt50

I have tried ruffus extensively, but in the end I decided I don't like its syntax. It complicates python's syntax and in the end, I recoded my pipeline as a simple python script, because I was getting errors that I couldn't understand well. What I am really looking for is something on the style of biomake/skam:

ADD REPLYlink modified 4.8 years ago by Istvan Albert ♦♦ 77k • written 8.3 years ago by Giovanni M Dall'Olio26k

I think (Snakemake)[]  could fit this bill.

(Personally I'll try Ruffos as well)

ADD REPLYlink written 2.7 years ago by ajasja.ljubetic0
gravatar for brentp
8.3 years ago by
Salt Lake City, UT
brentp22k wrote:

I have done some with ruffus but have reverted back to using make files. Ruffus is a nice idea and implemented very nicely, but often, in a pipeline, I just want to overwrite files--since I'm exploring and making lots of changes and mistakes. With rufus, I found I was spending a lot of time tracking down which files had/had not changed. For some reason, it's easier for me to deal with a Makefile, with or without using dependencies. YMMV. I just order the make with steps that come first at the top and add stuff to the bottom as I extend the pipeline. This is very simple, but works well for now. I'm interested to see what other responses are added here.

As others have mentioned, a README.txt and documentation at the top of the script are a good idea. Also, for any script that takes more than 2 arguments, use a getopt equivalent e.g. optparse in python).

Finally, extract as much code as possible into tested/test-able libraries.

ADD COMMENTlink written 8.3 years ago by brentp22k

I am the developer of Ruffus (and just posted a reply to Istvan above).

Ruffus has recently acquired much more tracing (like make -n but understandable), and a "touch" mode (like make -t you can update selected parts of the pipeline).

I am always interested from people still using make as it was the frustrations of unmaintainable makefiles which drove us to develop Ruffus in the first place. I would be very grateful if you could email me more comments / feedback if you don't mind. Thanks

ADD REPLYlink written 8.2 years ago by Leo Goodstadt50

I guess it is hard to improve on a time tested approach such as make. Thanks for the review of ruffus.

ADD REPLYlink written 8.3 years ago by Istvan Albert ♦♦ 77k

Try argparse ( which is an extended argument parse library for python. I had a similar experience with ruffus, as I have tried it and then reverted to makefiles. I think the est is still biomake.

ADD REPLYlink written 8.3 years ago by Giovanni M Dall'Olio26k

Leo, thanks for the comment. I will check out Ruffus again and get back to you with comments. The trace and touch modes do sound good.

ADD REPLYlink written 8.2 years ago by brentp22k
gravatar for Madelaine Gogol
8.3 years ago by
Madelaine Gogol5.0k
Kansas City
Madelaine Gogol5.0k wrote:

I generally use a simple shell script if I have multiple commands or scripts to run. I also try to make a notes.txt file to remind myself of what I did. Doesn't take long and comes in handy.

Things that you didn't plan to re-use get re-used all the time, in my experience...

ADD COMMENTlink written 8.3 years ago by Madelaine Gogol5.0k

I'll try it. I actually just googled "makefile bioinformatics" and found this, which was helpful:

ADD REPLYlink written 8.2 years ago by Madelaine Gogol5.0k

try makefiles, it basically the same as shell scripts but you can define more than a task in a file.

ADD REPLYlink written 8.3 years ago by Giovanni M Dall'Olio26k
gravatar for Chris
8.3 years ago by
Chris1.6k wrote:

Since I work a lot with Python, I usually write a wrapper method that embeds the external script/program, i.e. calls it, parses its output and returns the desired information. The 'glueing' of several such methods then takes place within my Python code that calls all these wrappers. I guess that's a very common thing to do.


ADD COMMENTlink written 8.3 years ago by Chris1.6k

I know it but since I have starting use make, I can't do anything without some certain features. For example, if you write a 'glue' script, you don't have conditional execution of tasks, so you will always have to run all the pipeline at once, without the possibility of pausing it. Or again, with makefiles, if you change only one of the input files, the program will only run the steps that are necessary to obtain the results, while a batch script will re-run everything. Moreover, a Makefile script has a standard syntax and it is easier to understand what is happening.

ADD REPLYlink written 8.3 years ago by Giovanni M Dall'Olio26k
gravatar for hadasa
8.3 years ago by
hadasa1.0k wrote:

Since i use Ruby quite often, I have found Rake very useful in creating simple pipelines. Rake has an idea of a task(s) and you can have prerequisites for the tasks, thus create pipelines. See An extension to rake that can be used to build database-backed workflows — at github

ADD COMMENTlink written 8.3 years ago by hadasa1.0k

Wow! I'm so glad you linked this! I've been looking for something just like BioRake. I, too, use rake tasks to automate a lot of stuff. I also really like using Rails in general, because it's incredibly easy to generate data displays and put my projects on the web.

ADD REPLYlink written 7.7 years ago by Mohawkjohn30
gravatar for Perry
8.3 years ago by
Perry290 wrote:

Most of my work is in python, so I use paver, which is similar to makefiles or rake for ruby, but gives you access to all python libraries.

ADD COMMENTlink written 8.3 years ago by Perry290

paver does look like a good option, falling somewhere between a makefile and ruffus.

ADD REPLYlink written 8.3 years ago by brentp22k
gravatar for Darked89
8.3 years ago by
Barcelona, Spain
Darked894.2k wrote:

re Biomake:

It does look like a great tool but it is unsupported. What you get from Sourceforge is a snapshot with README pointing you to one dialect of Prolog (XSB) only to learn running the examples that project moved to another one (SWI-Prolog). Unless you know Prolog and can fix it Biomake is not functional as I last checked (Jan 2010).

ADD COMMENTlink modified 8.3 years ago • written 8.3 years ago by Darked894.2k

Yes, I have checked it on January 2009 and it was the same as it is now. I also wrote to the author and he confirmed that he is working on a different project now, and he doesn't plan to work on biomake soon.

ADD REPLYlink written 8.3 years ago by Giovanni M Dall'Olio26k
gravatar for Sequencegeek
7.0 years ago by
Sequencegeek730 wrote:

Something I've found useful in python is to set up a framework that allows me to run specific functions in a module from the command line. This allows all the similar scripts (usually data aggregation, plotting, and analysis) to be put into a single module and each step ran from the command line. If I want to make a pipeline I just use a simple shell script.

The following code allows for any function to be ran from the command line just by passing the fxn's name as the first argument after the script. So for example, if you had a module named with two fxns: findSpliceSites() and pretendToWork(), you can just do:

python pretendToWork arg1 arg2

if __name__ == "__main__":
    import sys 
    submitArgs(globals()[sys.argv[1]], sys.argv[1:])

*submitArgs is just a fxn that takes takes a function as an argument, and that function's arguments and runs it.

ADD COMMENTlink written 7.0 years ago by Sequencegeek730
gravatar for Manu Prestat
6.6 years ago by
Manu Prestat3.8k
Marseille, France
Manu Prestat3.8k wrote:

This is exactly the philosophy of the biopieces project.

ADD COMMENTlink written 6.6 years ago by Manu Prestat3.8k
gravatar for Sean Davis
4.7 years ago by
Sean Davis24k
National Institutes of Health, Bethesda, MD
Sean Davis24k wrote:

There are a lot of great approaches here. A relatively new player in this arena is Snakemake:

It is make-like, but has a lot of niceties for working in cluster environments (not that it requires a cluster).

ADD COMMENTlink written 4.7 years ago by Sean Davis24k
gravatar for marko.k.laakso
4.7 years ago by
European Union
marko.k.laakso80 wrote:

Anduril is a language independent workflow engine, which provides pretty advanced mechanisms for the iterative development and analysis. The system keeps track of the valid results and re-executes parts, which are affected by changes in their inputs, parameters or implementations. All intermediate results are accessible to the end user and the system takes care of the fault tolerance and the parallelization of the execution (its also possible to use remote hosts and cluster queues).

Anduril workflows are constructed with AndurilScript, which enables the use of conditional operations and responses to the outputs. The existing scripts and programs can be invoked using components such as StandardProcess and BashEvaluate but there are also hundreds of ready made components for various tasks in bioinformatics.

ADD COMMENTlink written 4.7 years ago by marko.k.laakso80
I second the use of Anduril because it is essentially "feature-complete". What I like most is that you do procedural programming in it (loops, functions) and parallel execution (on a cluster or locally) works like magic. The initial learning curve is a bit steep though. I recently summarized my experience with it in a few slides:
ADD REPLYlink written 2.6 years ago by Christian2.6k

I looked at your slides. It is pretty clear from them that the value of anduril is not in the pipeline component but in the reporting component, no?

I find that very confusing - is it a pipeline or a result formatting and reporting tool?  It seems ultimately a black box, data goes in on one end, no customization whatsoever then plots come out on the other side. I my experience this is not a sustainable model of science, I dearly wish it worked that way but it does not really.


ADD REPLYlink written 2.6 years ago by Istvan Albert ♦♦ 77k

At its core, Anduril is just a general-purpose pipeline scripting language like many others (Snakemake, Bpipe, Ruffus, GNU make, etc.), plus some nice little extras (like reporting or ready-made components for many areas in bioinformatics). One key difference is the programming language, which is currently AndurilScript but will be Scala in version 2. The possibility to compile pretty reports from component outputs is just a bonus not even built into the core of the framework but provided by auxiliary components (which you can but don't have to use).

Components are not meant to hide anything but rather to allow modular code re-use. This is much like writing functions or classes in other programming languages. The source code of all components (typically written in Bash, R, Python, or Perl) is publicly available on Bitbucket and thus readily customizable. Whenever I find that available components do not fit my needs (which happens quite frequently actually), I just modify them, write my own, code directly in AndurilScript, or embed native Bash/R/Perl/Python code into the workflow script using the corresponding 'Evaluate' components. Thus, as a developer, you remain always in full control over the complete workflow, down to the very last parameter (I would not use it otherwise :-)

ADD REPLYlink written 2.6 years ago by Christian2.6k
gravatar for
4.4 years ago by
European Union wrote:

I would suggest also bpipe that "...provides a platform for running big bioinformatics jobs that consist of a series of processing stages - known as 'pipelines'. ". It seems intriguing, anybody have experience with it ?

Feature comparison:

Tool    | GUI    | Command Line (*)    | Audit Trail    | Built in Cluster Support |    Workflow Sharing    | Online Data Source Integration    | Need Programming Knowledge? |    Easy Shell Script Portability

Bpipe    No    Yes    Yes    Yes    No    No    No    Yes

Ruffus    No    Yes    Yes    No    No    No    Yes    No

Galaxy    Yes    No    Yes    Yes    Yes    Yes    No    No

Taverna    Yes    No    Yes    Yes    Yes    Yes    No    No

Pegasus    Yes    Yes    Yes    Yes    Yes    Yes    Yes    No
ADD COMMENTlink modified 4.4 years ago by brentp22k • written 4.4 years ago by
gravatar for 2184687-1231-83-
7.0 years ago by
2184687-1231-83-4.9k wrote:

Another option is to use something like eHive:
The simplest interface is ensembl-hive/scripts/ to submit your scripts as "ad-hoc analysis". After that, there are control rules to define something like: "run this clustering script that generates 10.000 clusters, then run a second script on each of them for as many CPUs available as possible, then wait until all have finished, then run a third script on the results.

ADD COMMENTlink written 7.0 years ago by 2184687-1231-83-4.9k
gravatar for Lythimus
7.0 years ago by
Lythimus200 wrote:

If you believe you are going to use your pipeline a fair amount over time I would recommend rake regardless of what language it's written in. If you believe other people would be interested in using your pipeline consider forking the Galaxy project and converting your scripts into modules for Galaxy. I'm finding it's not as pretty from the developers end as the users end, but still usable.

ADD COMMENTlink written 7.0 years ago by Lythimus200
gravatar for chen
2.2 years ago by
chen1.6k wrote:

I think Common Workflow Language may be a good choice

It is developed by SBG

ADD COMMENTlink written 2.2 years ago by chen1.6k

Actually CWL is a community led initiative - over 30 academic and industrial organizations have contributed to the specification. Seven Bridges (SBG) is one of those organizations. We use CWL to run workflows on our cloud platform. Also, our SDK enables users to bring their command line tools to our platform by creating CWL wrappers for them. Our experience with CWL and our SDK led to the open source initiative, which includes two tools, the Rabix Composer, an IDE for CWL and the Rabix Executor, for running CWL locally and at scale. Both tools are in Beta, with 1.0 release planned for later in the summer.

ADD REPLYlink written 11 months ago by adrian.sharma20
gravatar for fac2003
2.2 years ago by
United States
fac2003170 wrote:

I have been looking for a good solution to this problem since working on my PhD (late 90s) and tried various approaches over the years. We've recently developed NextflowWorkbench as a practical way to build and reuse workflows. The platform offers both user interface and scripting and is designed for both bioinformaticians who write workflows, and users of the workflows (biologists with limited computational experience). Workflows run on laptop all the way to clusters provisioned in the cloud. See and this preprint. Compared to other systems and tools, NW helps with reproducibility by providing ways to automatically install the tools you need on the machine where the workflow will run.

We aim for the interface to be like using a commercial integrated development environment (very similar to JetBrains tools: IDEA/PyCharm, etc, in fact it is built on the JetBrains IDEA platform).

ADD COMMENTlink written 2.2 years ago by fac2003170
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 854 users visited in the last hour