Question: Makefile-Driven Workflows And Bioconductor Objects
gravatar for gbayon
7.2 years ago by
gbayon160 wrote:

Hi everybody.

I have been working for 10 months in Bioinformatics now. Although I have learned a lot of things, I still consider myself a newbie in many aspects. One of the things I am usually having nightmares with is the development and maintenance of workflows for our projects. Most of them are currently analyses of Illumina 450k methylation data (and some ChIP-seq's are also on the way).

Truth is I feel comfortable working with make. I have been a UNIX user for many years, and do I like the simplicity and power of Makefiles. Moreover, you can find make in every system, so it has been tested a lot. I like to learn about and test different approaches and frameworks like Taverna and Pegasus, but I always have the suspicion that every tool is going to dissappear in the long term while make remains (as it has been for many years).

I have faced the typical problems with makefile-driven workflows, such as the multiple target rules, defining ways of sharing information between workflows, avoiding the replication of code by using systemwide scripts, etc.. I somehow managed to continue, but now I am experiencing some problems and I would like to ask for some advice or hint.

Problem is, I usually develop scripts in R/Bioconductor and I wanted to change my workflows into something more zen (text files for communication, so I could even use UNIX command line tools such as awk). Most of the time I am using objects like MethylSet, RGChannelSet, GRanges, ... For example, if I would like to develop a script for importing raw data from a set of IDAT files and a sample sheet, I could use the minfi package and read the whole experiment into a RGChannelSet. This object contains information for the red and green channels, as long as phenotype data from the sample sheet. Then, if I want to produce some output in text file format, I run into problems.

For example, I could choose to produce three text files: phenodata, red raw data and green raw data. That would be fine, and I could work over them with awk, or grep, in order to do other things. But I don't really like it if I think that:

  • I could miss information from the original Bioconductor object.
  • Red and green files are not semantically related any more. It's up to me to remember to process them as a single item in downstream analyses.
  • In later steps, I could need to write a script for reading those separate files into a RGChannelSet again. Flexibility is maximal, but I could lose information along the way, or even messing things by mixing channels from different sources.

I was previously using .RData files to pass information from task to task. And I thought even about JSON for this task, but I have to admit that the column plain text file format still looks stylish (and useful) to me.

Has anybody experienced this kind of situation? Anybody played with JSON for this kind of things? Are Taverna or Pegasus worth the effort and time, or should I stick with the old, proven, tools?

Any advice or hint will be much appreciated.

Regards, Gus

ADD COMMENTlink modified 6.2 years ago by tony.fischetti20 • written 7.2 years ago by gbayon160
gravatar for Chris Miller
7.2 years ago by
Chris Miller21k
Washington University in St. Louis, MO
Chris Miller21k wrote:

Someone pointed me to Drake the other day. I haven't used it, but it seems appropriate for a lot of bioinformatics tasks:

ADD COMMENTlink written 7.2 years ago by Chris Miller21k

Another useful link. Thank you! Didn't know about Drake.

ADD REPLYlink written 7.2 years ago by gbayon160

I tried Drake and it looks very promising, addressing the need of agile data-driven bioinformatics pipeline development. What I didn't like compared to GNU Make is that it was fairly heavy-weight (had to download >150 Mb on my Ubuntu for installation, which makes it rather elaborate to set-up and run on other systems) and that it took a couple of seconds every time I started it (because of a huge .jar file that gets loaded into memory). It's also fairly new and under active development, so bugs are to be expected.

ADD REPLYlink written 7.1 years ago by Christian2.9k
gravatar for Daniel Swan
7.2 years ago by
Daniel Swan13k
Aberdeen, UK
Daniel Swan13k wrote:

I'm wondering if in that question there is really just the requirement for building lightweight workflows for bioinformatics. You might be interested in something like bpipe perhaps? Not an endorsement as I've not used it. but it might fit your requirements.

ADD COMMENTlink written 7.2 years ago by Daniel Swan13k

looks good. thanks for posting this tool

ADD REPLYlink written 7.2 years ago by Jeremy Leipzig19k

That is an amazing finding. Thank you very much for the link. bpipe looks like a great tool. And yes, you are quite right. I am also wondering about workflow-definition tools. But my current worries are more focused on the data interchange format between steps, as in the RGChannelSet example above.

ADD REPLYlink written 7.2 years ago by gbayon160

I am currently using it, and will be publishing a paper with my bpipe pipeline definitions in the supplement. It's a really great tool, and can't recommend it highly enough.

ADD REPLYlink written 7.2 years ago by Matt Shirley9.3k
gravatar for Jeremy Leipzig
7.2 years ago by
Philadelphia, PA
Jeremy Leipzig19k wrote:

I will go on record as saying Pegasus is not appropriate for agile development and of course won't really address your goal of keeping metadata human-readable. Have you looked into Ruffus or Rake? Snakemake Then you would be able to leverage a scripting language without losing the dependency-tree type stuff you got from Make, and without committing to something heavy.

With regard to file formats I think JSON or YAML are viable formats until you get past, say 100k rows.

Should we be using much more JSON in our delimited data formats?

ADD COMMENTlink modified 3 months ago by RamRS26k • written 7.2 years ago by Jeremy Leipzig19k

When I started reading the Pegasus documentation, I was a little bit overwhelmed by it. You know, we are a small lab, and I am the only bioinformatic. It seems to me that we are not playing at the same category. I have looked into Ruffus more than Rake. It seems interesting. I love Python, but now I have a lot of working code in R/Bioconductor, and would like more something language neutral, as the bpipe or snakemake cited below.

The link about JSON is an incredible worth read. Thanks a lot.

I think I have not explained myself pretty well, because my worries are more centered on the text file format than in the workflow-definition tool. I would like to avoid binary formats like RData, but would like to retain the structure of the richer Bioconductor objects. But I think the answer for that is going to be related to silver and bullets. ;)

ADD REPLYlink written 7.2 years ago by gbayon160

it might be interesting to develop a package that write data.frames as would write.table but its attributes would be printed as JSON in a commented out header.

ADD REPLYlink modified 7.2 years ago • written 7.2 years ago by Jeremy Leipzig19k
gravatar for ed.liaw
7.2 years ago by
ed.liaw90 wrote:

I ran into Snakemake while looking for a Make replacement: . It doesn't have a large user base and it requires Python 3.2+, -- two reasons why I haven't started implementing anything in it yet. It does however have a good list of examples of bioinformatics pipelines and.. it's Python, so you could handle Bioconductor objects as-is.

ADD COMMENTlink modified 7.2 years ago • written 7.2 years ago by ed.liaw90

you mean via rpy?

ADD REPLYlink written 7.2 years ago by Jeremy Leipzig19k

Yep, rpy2 has python 3.3 support too afaik.

ADD REPLYlink written 7.2 years ago by ed.liaw90

I took a glance on snakemake this morning. My impression was that it was kind of language neutral, isn't it? Made in Python, you know, but able to express processes as scripts, bash tools, etc.

Didn't know I could handle Bioconductor objects from Python, either. Amazing fact. In fact, before actually posting the question, snake make was the candidate. But I am still wondering about a text file format that could represent a Bioc object, but allowing external bash-like tools to query and modify it.

ADD REPLYlink written 7.2 years ago by gbayon160

To your first question, yeah: the shell commands are done through the subprocess module, so you'll write them just like you would in bash or Make, only encapsulated in strings.

To your second question, I don't really know. You could do JSON dumps--and Python has good tools to do that--though I don't consider them to be super-friendly for stuff like sed. High dimensional data is pretty hard to deal with when only using simple tools. I prefer to keep them in Python and dump out what I need to troubleshoot on the spot.

ADD REPLYlink modified 7.2 years ago • written 7.2 years ago by ed.liaw90

I agree with you. Guess I was kind of pursuing something that cannot exist. I was thinking now to go back to working with RData files, which allow me to keep the semantic relationships (which is something very useful, specially if we are talking about SummarizedExperiment or GenomicMethylSet objects), but I could also write a couple of small scripts for extracting and inserting data from and into RData files.

After thinking all night about it (Analysis Paralysis, you know), I have to admit that, although RData files are binary, they are open and portable, so I guess the drawbacks are not that bad.

ADD REPLYlink written 7.2 years ago by gbayon160
gravatar for tony.fischetti
6.2 years ago by
tony.fischetti20 wrote:

I wrote a tool called "sake" that you might find helpful. It's specifically for computational science pipelines:

ADD COMMENTlink written 6.2 years ago by tony.fischetti20

The visualization feature is nice. Could you elaborate on the advantages and disadvantages over GNU make?

ADD REPLYlink written 6.2 years ago by Christian2.9k

Absolutely! I wrote all about it in the documentation:

Make: - As a consequence of being so powerful for source code compilation, Makefiles can sometimes be very hard to read and write, particularly to the unfamiliar. - Most make software assumes that if the timestamp of a file changes, than all subsequent steps that depend on that file need to be run in order to remain up-to-date. In many cases, though, the time-stamp of a file can change but the contents remain the same and, thus, a rebuild isn't necessary. - The syntax of makefiles makes it difficult to intuit the flow of a pipeline. Additionally, outside tools have to be used in order to visualize the flow. - Steps like displaying help and cleaning intermediate files are not handled automatically by make and are prone to errors.

Sake: - Sakefiles are written in a very easy-to-read-and-write markup language.... - Originally borne out of the frustration of rebuilding targets whenever a timestamp of a file changes and, therefore, being difficult to use for data analysis with very long time-consuming analytics, sake actually reads the file to determine if a re-building is really necessary. - The clean nature of the sakefile makes it much easier to intuit the flow of a pipeline. Additionally, a visualization mechanism is built right in which produces an image of the dependency graph that is easy to study, to share, and is aesthetically pleasing. - Sake handles some 'administrative' tasks for the user. This cuts down on hard-to-track-down errors.

And thanks for the kind words about the visualization :)

ADD REPLYlink written 6.2 years ago by tony.fischetti20
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1748 users visited in the last hour