Question

Best Practices For Pipeline Versioning

6

Entering edit mode

11.4 years ago

toni ★ 2.2k

Hi,

I am developing some processing pipelines (Quality Control, Mapping, Variant Calling ...) for NGS data.

Certainly a bad behavior, but at the beginning you make some choices about your pipeline (tools used, order of tools, home-made code for a particular step, ...) and your pipeline gives you satisfaction for quite a while. But then one day, a brand new mapper arrives and you need to remove the old core mapping algorithm and plug in the new one (and you may need to adapt some processing around this tool as well). This new pipeline will generate new results with its own specificities and one has to be able to get back the exact sequence of processing tools used. The older pipeline is not going to be deleted and might be used again for some specific project.

There are several levels of changes that may need to be tracked. For instance if we look at a pipeline as a series of processing boxes, changes could be :

A tool inside a box changes (BWA for Stampy for instance)
The version of a used tool in a box changes
Parameters supplied to a tool change (bwa sampe for bwa sampe -s)
Order of processing changes (MarkDup before IndelRealigner or the opposite)
Others minor changes in the code that could slightly affect the behavior of pipeline

What changes, in your opinion, requires to be tracked and what are your experience/practice on this matter ? Is this essentially a manual task or some tools do exist to provide an automatic numbering?

How do you organize your code to enable this tracking ?

Thanks for your inputs,

T.

pipeline • 4.2k views

ADD COMMENT • link updated 11.4 years ago by Ryan Dale 5.0k • written 11.4 years ago by toni ★ 2.2k

3

Entering edit mode

I would suggest you to always track all changes in your code. You never know when a bug will be introduced in your program/pipeline (or you wouldn't have introduced it!). As I see it, there are 2 kind of changes in a pipeline: i) changes in your own code and 2) changes in the code you only call/execute (for example, changing the version of a tool). Changes in your code can be easily tracked, but changes in the external tools should be explicitly tracked (for example by including the version of the tool in the path of the tool -- /path/to/tool-v1.1/tool -- or something similar).

ADD REPLY • link 11.4 years ago by Miguel Pignatelli ▴ 140

0

Entering edit mode

A question to understand how to answer you better. Have you tried any version control system, like hg, git or cvs?

ADD REPLY • link 11.4 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Yes, I am using SVN. The point here is to have in the database where the pipelines' results are stored the versions of pipelines used to generate them. This version should point to the exact tools and version of tools that have been used. I thought about using SVN commit version number, but as all the pipelines are under the same SVN repository, this number grows even with a single comma modified, so it did not appear appropriate to me to use SVN version number at first glance (but I may be wrong!). So I am bit lost on that ... and as several mapping pipelines should co-exist for instance, I do not even know if it is better to duplicate the code with another tool in a box and set a version number accordingly or if I should manage this inside the module itself with "if ... else" statements. Another possibility would be to say that I have a BWA-Mapping pipeline and a Stampy-Mapping pipeline and make only grow version number of these 2 pipelines rather than having a generic "Mapping" pipeline with customizable tools in some boxes. As you can see, I am kind of starting to think about it, and what I am looking for here is rather an advise of what is a good practice for these kind of things .. I would prefer not stepping forward in a wrong path.

ADD REPLY • link 11.4 years ago by toni ★ 2.2k

score 4 · Answer 1 · 2013-02-18

I try to write the "pipeline" code in a way that is traceable.

The way I use to track all the steps is with XML files.

Most of my wrappers or pipeliens keep track of the arguments in input (one of the xml values would be the actual command passed) the version of the tool, the time and date (start and end), and other information of possible interest. Then I produce the XML file (in Perl, using XML::Simple probably not the most versatile, but as the name suggests, relatively simple). The advantage is that it is easy to re-read and parse, it easily adapt to data structures (array, hashes...) you can very easily add new values later on (your XML files will be different, but still somehow compatible).

Clearly you need to find a balance between what you want to track, and in what steps, but if you keep all your parameters in a hash, for instance, dumping it to xml is very easy.

Alternatively, you could come up with some methods as the one used with sam files, where such information is in the data itself. It is easier to track, but it requires that the out data allows you to do so (fastq files, for instance, don't)

score 3 · Answer 2 · 2013-02-19

This is quite off-the-beaten-path and is probably overkill, but you might look into storing rich metadata with the results using a file system like iRODS. You could, for example, write the results into the iRODS system tagged with the current pipeline SVN revision and any other metadata you like.

score 3 · Answer 3 · 2013-02-19

I use ruffus with a homemade plugin system and a config file that is written to the output dir. It handles each of the points you bring up:

"1. A tool inside a box changes (BWA for Stampy for instance)"

The config specifies things like which plugin I want to use (e.g., "bwa" or "stampy" for the alignment). Plugins are designed to accept universal input (say, fastq for an aligner plugin) and generate universal output (e.g., bam), so they are interchangeable in the context of the pipeline. It admittedly takes some effort to write these plugins such that they are truly interchangeable.

"2. The version of a used tool in a box changes"

The config also contains paths to tools; importantly version info is contained within a pathname

"3. Parameters supplied to a tool change (bwa sampe for bwa sampe -s)"

These are in the config as well, and are passed to the relevant plugin

"4. Order of processing changes (MarkDup before IndelRealigner or the opposite)"

ruffus generates nice dependency graphs of the tasks in the pipeline, so saving this as an SVG in the output dir records the workflow. This does require edits to the pipeline code, so that needs to be tracked somehow (see point 5)

"5. Other minor changes in the code that could slightly affect the behavior of pipeline"

Tracking version info with git handles this, though I should probably add commit info as you suggest for SVN. Now that I think about it, maybe cloning the current git HEAD to the output dir would be the most complete method for tracking.