Question: How To Organize A Pipeline Of Small Scripts Together?
18
gravatar for Giovanni M Dall'Olio
4.1 years ago by
London, UK
Giovanni M Dall'Olio17k wrote:

In bioinformatics it is very common to end up with a lot of small scripts, each one with a different scope - plotting a chart, converting a file into another format, execute small operations - so it is very important to have a good way to clue them together, to define which should be executed before the others and so on.

How do you deal with the problem? Do you use makefiles, taverna workflows, batch scripts, or any other solution?

ADD COMMENTlink modified 12 weeks ago by ff.cc.cc1.1k • written 4.1 years ago by Giovanni M Dall'Olio17k
15
gravatar for Michael Barton
4.1 years ago by
Michael Barton1.7k
Akron, Ohio, United States
Michael Barton1.7k wrote:

My answer would be: don't bother. I've often found that much of the scripts I write are never used again after the initial use. Therefore spending time using a complex framework that considers dependency between scripts is a waste because the results might be negative and you never visit the analysis again. Even if you do end up using the script multiple times a simple hacky bash script might be more than enough to meet the requirements.

There will however be the 1-2% of initial analyses that return a interesting result and therefore need to be expanded with more deeper investigation. I think this is the point to invest more time time in organising the project. For me I use Rake because it's simple and allows me to write in the language I'm used to (Ruby).

Overall I think pragmatism is the important factor in computational biology. Just do enough to get the results you need and only invest more time when it's necessary. There's so many blind alleys in computational analysis of biological data it's not worth investing too much of your time until it's necessary.

ADD COMMENTlink written 4.1 years ago by Michael Barton1.7k
1

well, a Makefile with phony rules is not much difficult to write. Even if you use a script only once, it is useful to write down the options and the files on which you have launched it. Moreover, sometimes you use binary files like the ones from emboss or blast, which have lot of options, and you need to annotate the options you have used to make your results reproducible.

ADD REPLYlink written 4.1 years ago by Giovanni M Dall'Olio17k
15
gravatar for Etal
4.1 years ago by
Etal510
Athens, GA
Etal510 wrote:

The most important thing for me has been keeping a README file at the top of each project directory, where I write down not just how to run the scripts, but why I wrote them in the first place -- coming back to a project after a several-month lull, it's remarkable difficult to figure out what all the half-finished results mean without detailed notes.

That said:

  • make is pretty handy for simple pipelines that need to be re-run a lot
  • I'm also intrigued by waf and scons, since I use Python a lot
  • If a pipeline only takes a couple of minutes to run, and you only re-run it every few days, coercing it into a build system doesn't really save time overall for that project
  • But once you're used to working with a build system, the threshold where it pays off to use it on a new project drops dramatically
ADD COMMENTlink written 4.1 years ago by Etal510
2

Thanks. I think waf is too much oriented towards compiling programs, I have tried to use it but I don't think it is very useful for what we need to do. As for your third point, I use a lot of .Phony targets, which means that I just use a generic name (e.g. align_sequences, get_data, calculate_x) and the list of commands, with few dependencies. A bit like how I have shown in these slides: http://bioinfoblog.it/?p=29

ADD REPLYlink written 4.1 years ago by Giovanni M Dall'Olio17k

interesting frameworks

ADD REPLYlink written 4.1 years ago by Istvan Albert ♦♦ 39k
12
gravatar for Giovanni M Dall'Olio
4.1 years ago by
London, UK
Giovanni M Dall'Olio17k wrote:

My favorite way of defining pipelines is by writing Makefiles, about which you can find a very good introduction in Software Carpentry for Bioinformatics: http://software-carpentry.org/v4/make/ .

Although they have been originally developed for compiling programs, Makefiles allow to define which operations are needed to create each file, with a declarative syntax that it is a bit old-style but still does its job. Each Makefile is composed of a set of rules, which define operations needed to calculate a file and that can be combined together to make a pipeline. Other advantages of makefiles are conditional execution of tasks, so you can stop the execution of a pipeline and get back to it later, without having to repeat calculations. However, one of the big disadvantages of Makefiles is its old syntax... in particular, rules are identified by the names of the files that they create, and there is no such thing as 'titles' for rules, which make more tricky.

I think one of the best solutions would be to use BioMake, that allow to define tasks with titles that are not the name of the output files. To understand it better, look at this example: you see that each rule has a title and a series of parameters like its output, inputs, comments, etc.

Unfortunately, I can't make biomake to run on my computer, as it requires very old dependencies and it is written in a very difficult perl. I have tried many alternatives and I think that rake is the one that is more close to biomake, but unfortunately I don't understand ruby's syntax.

So, I am still looking for a good alternative... Maybe one day I will have to time to re-write BioMake in python :-)

ADD COMMENTlink modified 4 months ago • written 4.1 years ago by Giovanni M Dall'Olio17k

can you do 'loops' with (bio)make ?

ADD REPLYlink written 4.1 years ago by Pierre Lindenbaum58k

no idea :-( Unfortunately I could never make it working. Anyway, since these are usually declarative-like syntax, you don't make loops, you just apply a function on an array of values (e.g. like in R with the apply function)

ADD REPLYlink written 4.1 years ago by Giovanni M Dall'Olio17k

Could you please check once that above link you mention is not working. I am going to start to build up genome assembly+analysis pipeline. So i think above link may be helpful for me . Or else could you please suggest something for beginners

ADD REPLYlink written 4 months ago by HG210

I've updated the link, changing it to the new version of the tutorial in the latest Software Carpentry website. However, I liked the original version better (see http://web.archive.org/web/20120624090322/http://swc.scipy.org/lec/build.html )

ADD REPLYlink written 4 months ago by Giovanni M Dall'Olio17k

Thank you so much for your update ....could you please suggest me something for beginners . I have stared my PhD just 6months ..

ADD REPLYlink written 4 months ago by HG210
8
gravatar for Istvan Albert
4.1 years ago by
Istvan Albert ♦♦ 39k
University Park, USA
Istvan Albert ♦♦ 39k wrote:

I don't have personal experience with this package but it is something that I plan to explore in the near future:

Ruffus a lightweight python module to run computational pipelines.

ADD COMMENTlink written 4.1 years ago by Istvan Albert ♦♦ 39k
3

I am the developer of Ruffus. This is designed to be a "make" replacement with simpler(!) syntax. I wanted all the power of make and more while writing "normal" python scripts.

Ruffus has undergone substantial development lately, especially to simplify the syntax and improve the error messages. If your experience was with the ruffus of a few months ago, it might be worth having a quick look again and seeing if it has improved enough to change your mind. (Or it just may not be your cup of tea!) As always suggestions (even critical comments) are welcome.

ADD REPLYlink written 4.1 years ago by Leo Goodstadt40

I have tried ruffus extensively, but in the end I decided I don't like its syntax. It complicates python's syntax and in the end, I recoded my pipeline as a simple python script, because I was getting errors that I couldn't understand well. What I am really looking for is something on the style of biomake/skam: http://skam.sourceforge.net/skam-intro.html

ADD REPLYlink modified 7 months ago by Istvan Albert ♦♦ 39k • written 4.1 years ago by Giovanni M Dall'Olio17k
8
gravatar for Manuel Corpas
4.1 years ago by
Manuel Corpas620
Cambridge
Manuel Corpas620 wrote:

This article may shed light onto how to organise bioinformatics projects. William Stafford Noble. "A quick guide to organizing computational biology experiments." PLoS Computational Biology. 5(7):e1000424, 2009

link: http://noble.gs.washington.edu/papers/noble2009quick.html

For me using git and the directory structure that Bill Noble mentiones in this articles has been a better approach than what I had before.

ADD COMMENTlink written 4.1 years ago by Manuel Corpas620
7
gravatar for brentp
4.1 years ago by
brentp17k
Denver, Colorado
brentp17k wrote:

I have done some with ruffus but have reverted back to using make files. Ruffus is a nice idea and implemented very nicely, but often, in a pipeline, I just want to overwrite files--since I'm exploring and making lots of changes and mistakes. With rufus, I found I was spending a lot of time tracking down which files had/had not changed. For some reason, it's easier for me to deal with a Makefile, with or without using dependencies. YMMV. I just order the make with steps that come first at the top and add stuff to the bottom as I extend the pipeline. This is very simple, but works well for now. I'm interested to see what other responses are added here.

As others have mentioned, a README.txt and documentation at the top of the script are a good idea. Also, for any script that takes more than 2 arguments, use a getopt equivalent e.g. optparse in python).

Finally, extract as much code as possible into tested/test-able libraries.

ADD COMMENTlink written 4.1 years ago by brentp17k
1

I am the developer of Ruffus (and just posted a reply to Istvan above).

Ruffus has recently acquired much more tracing (like make -n but understandable), and a "touch" mode (like make -t you can update selected parts of the pipeline).

I am always interested from people still using make as it was the frustrations of unmaintainable makefiles which drove us to develop Ruffus in the first place. I would be very grateful if you could email me more comments / feedback if you don't mind. Thanks

ADD REPLYlink written 4.1 years ago by Leo Goodstadt40

I guess it is hard to improve on a time tested approach such as make. Thanks for the review of ruffus.

ADD REPLYlink written 4.1 years ago by Istvan Albert ♦♦ 39k

Try argparse (http://code.google.com/p/argparse/) which is an extended argument parse library for python. I had a similar experience with ruffus, as I have tried it and then reverted to makefiles. I think the est is still biomake.

ADD REPLYlink written 4.1 years ago by Giovanni M Dall'Olio17k

Leo, thanks for the comment. I will check out Ruffus again and get back to you with comments. The trace and touch modes do sound good.

ADD REPLYlink written 4.1 years ago by brentp17k
6
gravatar for Chris
4.1 years ago by
Chris1.5k
Munich
Chris1.5k wrote:

Since I work a lot with Python, I usually write a wrapper method that embeds the external script/program, i.e. calls it, parses its output and returns the desired information. The 'glueing' of several such methods then takes place within my Python code that calls all these wrappers. I guess that's a very common thing to do.

Chris

ADD COMMENTlink written 4.1 years ago by Chris1.5k

I know it but since I have starting use make, I can't do anything without some certain features. For example, if you write a 'glue' script, you don't have conditional execution of tasks, so you will always have to run all the pipeline at once, without the possibility of pausing it. Or again, with makefiles, if you change only one of the input files, the program will only run the steps that are necessary to obtain the results, while a batch script will re-run everything. Moreover, a Makefile script has a standard syntax and it is easier to understand what is happening.

ADD REPLYlink written 4.1 years ago by Giovanni M Dall'Olio17k
6
gravatar for hadasa
4.1 years ago by
hadasa930
hadasa930 wrote:

Since i use Ruby quite often, I have found Rake very useful in creating simple pipelines. Rake has an idea of a task(s) and you can have prerequisites for the tasks, thus create pipelines. See An extension to rake that can be used to build database-backed workflows — at github http://github.com/jandot/biorake

ADD COMMENTlink written 4.1 years ago by hadasa930

Wow! I'm so glad you linked this! I've been looking for something just like BioRake. I, too, use rake tasks to automate a lot of stuff. I also really like using Rails in general, because it's incredibly easy to generate data displays and put my projects on the web.

ADD REPLYlink written 3.5 years ago by Mohawkjohn30
4
gravatar for Madelaine Gogol
4.1 years ago by
Madelaine Gogol3.3k
Kansas City
Madelaine Gogol3.3k wrote:

I generally use a simple shell script if I have multiple commands or scripts to run. I also try to make a notes.txt file to remind myself of what I did. Doesn't take long and comes in handy.

Things that you didn't plan to re-use get re-used all the time, in my experience...

ADD COMMENTlink written 4.1 years ago by Madelaine Gogol3.3k
2

I'll try it. I actually just googled "makefile bioinformatics" and found this, which was helpful: http://www.slideshare.net/giovanni/makefiles-bioinfo

ADD REPLYlink written 4.1 years ago by Madelaine Gogol3.3k

try makefiles, it basically the same as shell scripts but you can define more than a task in a file.

ADD REPLYlink written 4.1 years ago by Giovanni M Dall'Olio17k
4
gravatar for Perry
4.1 years ago by
Perry280
philadelphia
Perry280 wrote:

Most of my work is in python, so I use paver, which is similar to makefiles or rake for ruby, but gives you access to all python libraries.

ADD COMMENTlink written 4.1 years ago by Perry280

paver does look like a good option, falling somewhere between a makefile and ruffus. http://blog.doughellmann.com/2009/01/converting-from-make-to-paver.html

ADD REPLYlink written 4.1 years ago by brentp17k
3
gravatar for Darked89
4.1 years ago by
Darked893.5k
Barcelona, Spain
Darked893.5k wrote:

re Biomake:

It does look like a great tool but it is unsupported. What you get from Sourceforge is a snapshot with README pointing you to one dialect of Prolog (XSB) only to learn running the examples that project moved to another one (SWI-Prolog). Unless you know Prolog and can fix it Biomake is not functional as I last checked (Jan 2010).

ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by Darked893.5k

Yes, I have checked it on January 2009 and it was the same as it is now. I also wrote to the author and he confirmed that he is working on a different project now, and he doesn't plan to work on biomake soon.

ADD REPLYlink written 4.1 years ago by Giovanni M Dall'Olio17k
3
gravatar for Sequencegeek
2.9 years ago by
Sequencegeek640
UCLA
Sequencegeek640 wrote:

Something I've found useful in python is to set up a framework that allows me to run specific functions in a module from the command line. This allows all the similar scripts (usually data aggregation, plotting, and analysis) to be put into a single module and each step ran from the command line. If I want to make a pipeline I just use a simple shell script.

The following code allows for any function to be ran from the command line just by passing the fxn's name as the first argument after the script. So for example, if you had a module named myModule.py with two fxns: findSpliceSites() and pretendToWork(), you can just do:

python myModule.py pretendToWork arg1 arg2

if __name__ == "__main__":
    import sys 
    submitArgs(globals()[sys.argv[1]], sys.argv[1:])

*submitArgs is just a fxn that takes takes a function as an argument, and that function's arguments and runs it.

ADD COMMENTlink written 2.9 years ago by Sequencegeek640
3
gravatar for Manu Prestat
2.4 years ago by
Manu Prestat2.8k
Berkeley
Manu Prestat2.8k wrote:

This is exactly the philosophy of the biopieces project.

ADD COMMENTlink written 2.4 years ago by Manu Prestat2.8k
2
gravatar for Lythimus
2.9 years ago by
Lythimus200
Lythimus200 wrote:

If you believe you are going to use your pipeline a fair amount over time I would recommend rake regardless of what language it's written in. If you believe other people would be interested in using your pipeline consider forking the Galaxy project and converting your scripts into modules for Galaxy. I'm finding it's not as pretty from the developers end as the users end, but still usable.

ADD COMMENTlink written 2.9 years ago by Lythimus200
1
gravatar for 2184687-1231-83-
2.9 years ago by
2184687-1231-83-4.5k wrote:

Another option is to use something like eHive: http://www.ncbi.nlm.nih.gov/pubmed/20459813
The simplest interface is ensembl-hive/scripts/cmd_hive.pl to submit your scripts as "ad-hoc analysis". After that, there are control rules to define something like: "run this clustering script that generates 10.000 clusters, then run a second script on each of them for as many CPUs available as possible, then wait until all have finished, then run a third script on the results.

ADD COMMENTlink written 2.9 years ago by 2184687-1231-83-4.5k
1
gravatar for Sean Davis
6 months ago by
Sean Davis15k
Bethesda, MD
Sean Davis15k wrote:

There are a lot of great approaches here. A relatively new player in this arena is Snakemake:

https://bitbucket.org/johanneskoester/snakemake/

It is make-like, but has a lot of niceties for working in cluster environments (not that it requires a cluster).

ADD COMMENTlink written 6 months ago by Sean Davis15k
1
gravatar for marko.k.laakso
6 months ago by
marko.k.laakso10 wrote:

Anduril anduril.org) is a language independent workflow engine, which provides pretty advanced mechanisms for the iterative development and analysis. The system keeps track of the valid results and re-executes parts, which are affected by changes in their inputs, parameters or implementations. All intermediate results are accessible to the end user and the system takes care of the fault tolerance and the parallelization of the execution (its also possible to use remote hosts and cluster queues).

Anduril workflows are constructed with AndurilScript, which enables the use of conditional operations and responses to the outputs. The existing scripts and programs can be invoked using components such as StandardProcess and BashEvaluate but there are also hundreds of ready made components for various tasks in bioinformatics.

ADD COMMENTlink written 6 months ago by marko.k.laakso10
1
gravatar for ff.cc.cc
12 weeks ago by
ff.cc.cc1.1k
ff.cc.cc1.1k wrote:

I would suggest also bpipe that "...provides a platform for running big bioinformatics jobs that consist of a series of processing stages - known as 'pipelines'. ". It seems intriguing, anybody have experience with it ?

Feature comparison:

Tool    | GUI    | Command Line (*)    | Audit Trail    | Built in Cluster Support |    Workflow Sharing    | Online Data Source Integration    | Need Programming Knowledge? |    Easy Shell Script Portability

Bpipe    No    Yes    Yes    Yes    No    No    No    Yes

Ruffus    No    Yes    Yes    No    No    No    Yes    No

Galaxy    Yes    No    Yes    Yes    Yes    Yes    No    No

Taverna    Yes    No    Yes    Yes    Yes    Yes    No    No

Pegasus    Yes    Yes    Yes    Yes    Yes    Yes    Yes    No
ADD COMMENTlink modified 12 weeks ago by brentp17k • written 12 weeks ago by ff.cc.cc1.1k
Please log in to add an answer.

Help
Access
  • RSS
  • Stats
  • API

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.0.0
Traffic: 123 posts viewed in the last hour