I have been working for 10 months in Bioinformatics now. Although I have learned a lot of things, I still consider myself a newbie in many aspects. One of the things I am usually having nightmares with is the development and maintenance of workflows for our projects. Most of them are currently analyses of Illumina 450k methylation data (and some ChIP-seq's are also on the way).
Truth is I feel comfortable working with make. I have been a UNIX user for many years, and do I like the simplicity and power of Makefiles. Moreover, you can find make in every system, so it has been tested a lot. I like to learn about and test different approaches and frameworks like Taverna and Pegasus, but I always have the suspicion that every tool is going to dissappear in the long term while make remains (as it has been for many years).
I have faced the typical problems with makefile-driven workflows, such as the multiple target rules, defining ways of sharing information between workflows, avoiding the replication of code by using systemwide scripts, etc.. I somehow managed to continue, but now I am experiencing some problems and I would like to ask for some advice or hint.
Problem is, I usually develop scripts in R/Bioconductor and I wanted to change my workflows into something more zen (text files for communication, so I could even use UNIX command line tools such as awk). Most of the time I am using objects like MethylSet, RGChannelSet, GRanges, ... For example, if I would like to develop a script for importing raw data from a set of IDAT files and a sample sheet, I could use the minfi package and read the whole experiment into a RGChannelSet. This object contains information for the red and green channels, as long as phenotype data from the sample sheet. Then, if I want to produce some output in text file format, I run into problems.
For example, I could choose to produce three text files: phenodata, red raw data and green raw data. That would be fine, and I could work over them with awk, or grep, in order to do other things. But I don't really like it if I think that:
- I could miss information from the original Bioconductor object.
- Red and green files are not semantically related any more. It's up to me to remember to process them as a single item in downstream analyses.
- In later steps, I could need to write a script for reading those separate files into a RGChannelSet again. Flexibility is maximal, but I could lose information along the way, or even messing things by mixing channels from different sources.
I was previously using .RData files to pass information from task to task. And I thought even about JSON for this task, but I have to admit that the column plain text file format still looks stylish (and useful) to me.
Has anybody experienced this kind of situation? Anybody played with JSON for this kind of things? Are Taverna or Pegasus worth the effort and time, or should I stick with the old, proven, tools?
Any advice or hint will be much appreciated.
Another useful link. Thank you! Didn't know about Drake.
I tried Drake and it looks very promising, addressing the need of agile data-driven bioinformatics pipeline development. What I didn't like compared to GNU Make is that it was fairly heavy-weight (had to download >150 Mb on my Ubuntu for installation, which makes it rather elaborate to set-up and run on other systems) and that it took a couple of seconds every time I started it (because of a huge .jar file that gets loaded into memory). It's also fairly new and under active development, so bugs are to be expected.