In bioinformatics it is very common to end up with a lot of small scripts, each one with a different scope - plotting a chart, converting a file into another format, execute small operations - so it is very important to have a good way to clue them together, to define which should be executed before the others and so on.
How do you deal with the problem? Do you use makefiles, taverna workflows, batch scripts, or any other solution?
The most important thing for me has been keeping a README file at the top of each project directory, where I write down not just how to run the scripts, but why I wrote them in the first place -- coming back to a project after a several-month lull, it's remarkable difficult to figure out what all the half-finished results mean without detailed notes.
makeis pretty handy for simple pipelines that need to be re-run a lot
My answer would be: don't bother. I've often found that much of the scripts I write are never used again after the initial use. Therefore spending time using a complex framework that considers dependency between scripts is a waste because the results might be negative and you never visit the analysis again. Even if you do end up using the script multiple times a simple hacky bash script might be more than enough to meet the requirements.
There will however be the 1-2% of initial analyses that return a interesting result and therefore need to be expanded with more deeper investigation. I think this is the point to invest more time time in organising the project. For me I use Rake because it's simple and allows me to write in the language I'm used to (Ruby).
Overall I think pragmatism is the important factor in computational biology. Just do enough to get the results you need and only invest more time when it's necessary. There's so many blind alleys in computational analysis of biological data it's not worth investing too much of your time until it's necessary.
Although they have been originally developed for compiling programs, Makefiles allow to define which operations are needed to create each file, with a declarative syntax that it is a bit old-style but still does its job. Each Makefile is composed of a set of rules, which define operations needed to calculate a file and that can be combined together to make a pipeline. Other advantages of makefiles are conditional execution of tasks, so you can stop the execution of a pipeline and get back to it later, without having to repeat calculations. However, one of the big disadvantages of Makefiles is its old syntax... in particular, rules are identified by the names of the files that they create, and there is no such thing as 'titles' for rules, which make more tricky.
I think one of the best solutions would be to use BioMake, that allow to define tasks with titles that are not the name of the output files. To understand it better, look at this example: you see that each rule has a title and a series of parameters like its output, inputs, comments, etc.
Unfortunately, I can't make biomake to run on my computer, as it requires very old dependencies and it is written in a very difficult perl. I have tried many alternatives and I think that rake is the one that is more close to biomake, but unfortunately I don't understand ruby's syntax.
So, I am still looking for a good alternative... Maybe one day I will have to time to re-write BioMake in python :-)
This article may shed light onto how to organise bioinformatics projects. William Stafford Noble. "A quick guide to organizing computational biology experiments." PLoS Computational Biology. 5(7):e1000424, 2009
For me using git and the directory structure that Bill Noble mentiones in this articles has been a better approach than what I had before.
I have done some with ruffus but have reverted back to using make files. Ruffus is a nice idea and implemented very nicely, but often, in a pipeline, I just want to overwrite files--since I'm exploring and making lots of changes and mistakes. With rufus, I found I was spending a lot of time tracking down which files had/had not changed. For some reason, it's easier for me to deal with a Makefile, with or without using dependencies. YMMV. I just order the make with steps that come first at the top and add stuff to the bottom as I extend the pipeline. This is very simple, but works well for now. I'm interested to see what other responses are added here.
As others have mentioned, a README.txt and documentation at the top of the script are a good idea. Also, for any script that takes more than 2 arguments, use a getopt equivalent e.g. optparse in python).
Finally, extract as much code as possible into tested/test-able libraries.
Since I work a lot with Python, I usually write a wrapper method that embeds the external script/program, i.e. calls it, parses its output and returns the desired information. The 'glueing' of several such methods then takes place within my Python code that calls all these wrappers. I guess that's a very common thing to do.
Since i use Ruby quite often, I have found Rake very useful in creating simple pipelines. Rake has an idea of a task(s) and you can have prerequisites for the tasks, thus create pipelines. See An extension to rake that can be used to build database-backed workflows — at github http://github.com/jandot/biorake
I generally use a simple shell script if I have multiple commands or scripts to run. I also try to make a notes.txt file to remind myself of what I did. Doesn't take long and comes in handy.
Things that you didn't plan to re-use get re-used all the time, in my experience...
It does look like a great tool but it is unsupported. What you get from Sourceforge is a snapshot with README pointing you to one dialect of Prolog (XSB) only to learn running the examples that project moved to another one (SWI-Prolog). Unless you know Prolog and can fix it Biomake is not functional as I last checked (Jan 2010).
Something I've found useful in python is to set up a framework that allows me to run specific functions in a module from the command line. This allows all the similar scripts (usually data aggregation, plotting, and analysis) to be put into a single module and each step ran from the command line. If I want to make a pipeline I just use a simple shell script.
The following code allows for any function to be ran from the command line just by passing the fxn's name as the first argument after the script. So for example, if you had a module named myModule.py with two fxns: findSpliceSites() and pretendToWork(), you can just do:
python myModule.py pretendToWork arg1 arg2
if __name__ == "__main__": import sys submitArgs(globals()[sys.argv], sys.argv[1:])
*submitArgs is just a fxn that takes takes a function as an argument, and that function's arguments and runs it.
If you believe you are going to use your pipeline a fair amount over time I would recommend rake regardless of what language it's written in. If you believe other people would be interested in using your pipeline consider forking the Galaxy project and converting your scripts into modules for Galaxy. I'm finding it's not as pretty from the developers end as the users end, but still usable.
Another option is to use something like eHive: http://www.ncbi.nlm.nih.gov/pubmed/20459813
The simplest interface is ensembl-hive/scripts/cmd_hive.pl to submit your scripts as "ad-hoc analysis". After that, there are control rules to define something like: "run this clustering script that generates 10.000 clusters, then run a second script on each of them for as many CPUs available as possible, then wait until all have finished, then run a third script on the results.
There are a lot of great approaches here. A relatively new player in this arena is Snakemake:
It is make-like, but has a lot of niceties for working in cluster environments (not that it requires a cluster).
Anduril (anduril.org) is a language independent workflow engine, which provides pretty advanced mechanisms for the iterative development and analysis. The system keeps track of the valid results and re-executes parts, which are affected by changes in their inputs, parameters or implementations. All intermediate results are accessible to the end user and the system takes care of the fault tolerance and the parallelization of the execution (its also possible to use remote hosts and cluster queues).
Anduril workflows are constructed with AndurilScript, which enables the use of conditional operations and responses to the outputs. The existing scripts and programs can be invoked using components such as StandardProcess and BashEvaluate but there are also hundreds of ready made components for various tasks in bioinformatics.
I would suggest also bpipe that "...provides a platform for running big bioinformatics jobs that consist of a series of processing stages - known as 'pipelines'. ". It seems intriguing, anybody have experience with it ?
Tool | GUI | Command Line (*) | Audit Trail | Built in Cluster Support | Workflow Sharing | Online Data Source Integration | Need Programming Knowledge? | Easy Shell Script Portability Bpipe No Yes Yes Yes No No No Yes Ruffus No Yes Yes No No No Yes No Galaxy Yes No Yes Yes Yes Yes No No Taverna Yes No Yes Yes Yes Yes No No Pegasus Yes Yes Yes Yes Yes Yes Yes No