Tracking The Version Of Third Party Tools Used
5
14
Entering edit mode
10.2 years ago

What are some good strategies for keeping track of which versions you're using of bowtie/tophat/cufflinks/samtools, etc.? I know you can just write down what you used, but...

We have an internal wiki where we track what's currently installed with dates, so that offers one way of tracking it.

One colleague suggested dotkit.

Does anyone have any other good suggestions on managing these multiple tools with different versions? What do you do if you get additional samples for a project that you did 6 months ago? Re-run everything with the latest version?

crossposted at seqanswers (Is that frowned upon?):

software • 2.3k views
1
Entering edit mode

Hi, I have changed the title, because it could have been confused with the problem of keeping version control of your scripts (e.g. see http://software-carpentry.org/4_0/vc/intro/)

0
Entering edit mode

Recent discussions on bioinfo-core mailing list also turned up encap package manager, though a lot of people seem to be using modules or symlinks.

http://www.encap.org/

My colleague Jim has also forked and improved a package manager called bio.brew that we've been using to streamline installations: https://github.com/vlandham/bio.brew

11
Entering edit mode
10.2 years ago
stuka ▴ 110

We use the modules system

http://modules.sourceforge.net/

Then in each job script the first lines are to load the needed software with the right version

0
Entering edit mode

This looks like a viable option, thanks.

0
Entering edit mode

I've tried this now, and it's excellent. We're intending to use it to control our dev/test/production environments over time.

6
Entering edit mode
10.2 years ago

We keep enough metadata to reproduce an analysis in version control, along with the analysis protocol. I usually start a new git repository for anything but the smallest analysis.

In practise, this means:

• A description of the purpose of the analysis.

• Any ad hoc data sent to us by the customer e.g. spreadsheets that they attached to emails.

• The md5sums of any data sent to us.

• Versions of all software packages used, both third-party and our own (also in a version control repository).

• Versions of all external data used, possibly including md5sums.

• Descriptions of all steps executed, including command-line arguments.

• The md5sums of any intermediate data that are too big to keep.

All the above reside in an Emacs org mode file. We use org-mode's various features such as embedding code snippets in various languages, TODO lists, scheduling, and spreadsheet formulae to organise the analysis, pushing commits to the repository as work is done. Finally, we use its LaTeX export to create a nice PDF for the customer.

Any ad hoc scripts that are too big to embed in the document are saved in the repository too, even though they'll never be used for any other purpose. A lot of the real work is done in pipelines from which we save the pipeline version and all the batch queue logs.

We encourage the customer to take a clone of the git repository as we work too. Then if they come back for more work later, we have a common basis for picking up where we left off.

0
Entering edit mode

This is impressive, but tiring. I sort of doubt everyone's going to be willing to do something like this.

0
Entering edit mode

Speaking from experience, it's not as tiring as trying to untangle old results when a reviewer asks clever questions at publication time. To play devil's advocate, people in the lab record this level of detail, should they not expect us to do the same? In practice, much is automated, so the work is done by our silicon minions.

4
Entering edit mode
10.2 years ago

This is a good question, thanks for asking it.

First, check whether the tool you are using creates a log file, and whether this contains a reference to the version of the software used.

Second, if your tool does not occupy much disk space, you can include it in the same folder of your project, and then use a version control software to keep track of the versions and the results. If you need to know more about version control software, see these tutorials.

Third, if your results are not extremely huge, you can also use a version control software to save the information about the versions used. When you make a commit, just describe the versions and the tools used in the message of the commit. If your results are too big, you may track only one of the files, or track only a log file. I recommend you to use hg as it can handle big files and can be used with bitbucket which offers unlimited hosting space.

To illustrate you better the latter example, I have created a repository on bitbucket for you. I am putting you a screenshot of it below.

0
Entering edit mode

I really should get into more version control in general.

4
Entering edit mode
10.2 years ago

I guess I am seriously low tech. For rapidly changing tools, I merely organize my bin (and src) as follows:

~/bin/[software]-[version]/[binary]


Take samtools as an example:

~/bin/samtools-0.1.9/samtools
~/bin/samtools-0.1.10/samtools
~/bin/samtools-0.1.11/samtools
~/bin/samtools-0.1.12/samtools
~/bin/samtools-latest --> ~/bin/samtools-0.1.12/


I then make a symlink to the "latest" version in my bin and have my $PATH use ~/bin: ~/bin/samtools --> ~/bin/samtools-latest/samtools  This allows me to run ad hoc analyses using the latest on the command line. Then in my (versioned) Makefile or shell scripts for documented research pipelines, I use local environment variables to explicitly control and document what version is used. export SAMTOOLS=~/bin/samtools-0.1.11/samtools # step 1: grab proper pairs with MAPQ >= 20$SAMTOOLS view -f 2 -q 20 BAM > out


Dreadfully low tech, but it works very well for me. I find that by combining this sort of approach with versioned scripts on sites like Gists for GitHub makes things very reproducible. For each project, I basically have a README that just points to my GitHub Gist.

0
Entering edit mode

By the way, thanks for asking this question. It's nice to know how others handle this problem.

0
Entering edit mode

We do something similar with "current" instead of "latest". I like your Makefile/shell scripts approach. I am really woefully behind on version control in general, I keep meaning to start using that.

Where I'm having problems right now is when one program calls another internally somewhere. One thing I did was to add a symlink to the version I want in the current directory and add the current directory to the front of my PATH temporarily, but this feels kind of insane and desparate, and I'm sure there are better ways I just don't know of.

0
Entering edit mode

Assuming the programs that make internal calls are scripts you have written, scripting languages can pull ENV vars at runtime.e.g., Perl would be:

#!/usr/bin/perl
$samtools =$ENV{'SAMTOOLS'};


However, I gather you may be referring to less controllable situations such as Tophat calling Bowtie. This is doable as well, the authors would just need to allow the call to be read as an ENV var from the command line...

3
Entering edit mode
10.2 years ago
Neilfws 49k

Before getting to good strategies, I thought I'd mention what I've seen people do in practice. There seem to be 2 common behaviours:

1. "Freeze" to a specific version and always work with that: then you can at least discount changes in the software as an explanation for anything unexpected
2. Always upgrade to the latest version as soon as it comes out, on the assumption that the latest = the best

As for a good strategy: I like the previous answers which suggest recording the software version in a version-controlled log. Another suggestion: where possible, have a small, easy to run "test case" which generates a small amount of output - you can then quickly inspect to see if a version change generated anything unexpected.