Question

Forum:How to enable scientific software reproducibility?

2

Entering edit mode

9.6 years ago

martenson ▴ 380

To make long story short I would like to ask you for input about using Homebrew as a means to achieve better reproducibility in scientific software pipelines.

Please see more context in the official repo of homebrew-science here: https://github.com/Homebrew/homebrew-science/issues/1191

versioning software galaxy reproducibility • 1.8k views

ADD COMMENT • link updated 2.3 years ago by Ram 43k • written 9.6 years ago by martenson ▴ 380

0

Entering edit mode

moved to "forum"

ADD REPLY • link 9.6 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

Brad Chapman is right when he brings up Docker. Docker is a much more complete solution for dependencies than Homebrew.

ADD REPLY • link 9.6 years ago by Jeremy Leipzig 22k

0

Entering edit mode

As I understand it Docker serves as a container to run things. In order to build the container you need to use something like Homebrew anyways.

ADD REPLY • link 9.6 years ago by martenson ▴ 380

1

Entering edit mode

That's true certainly dependency fetchers can be used by developers to build a dock. Honestly where the brew concept might be most attractive is for data. Right now I have scripts that are full of urls like:

ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/pilot3_exon_targetted_GRCh37_bams/data/NA12891/alignment/NA12891.chrom22.ILLUMINA.bwa.CEU.exon_targetted.20100311.bam

Honestly this is better maintained as a brew recipe by someone who works with data:

brew install 1kg_ceu_trio_bams

ADD REPLY • link 9.6 years ago by Jeremy Leipzig 22k

Ram · Answer 1 · 2014-09-24

Reproducibility is also dear to my heart - though lately I have be asking myself what does "reproducibility" even mean?

Does it mean that we should be able to run a pipeline with the exact same versions of the programs and the exact same parameters and they will produce the exact same answer no matter what? Nowadays I've come to think that this type of reproducibility is not all that useful.

The reproducibility that I hope from science is that of scientific observations and results.

I recall a paper (published in Nature) where (as it later turned out) choosing the size of the upstream region to be exactly 1000bp was the critical parameter to all subsequent results. The study would not produce the same results for 900bp nor with 1100 bp "upstream" regions. Basically the genes seemed to be regulated by upstream binding only when 1000bp was chosen as to what upstream meant ... that's some insight alright...

So perhaps the exact opposite is required, if a study cannot be reproduced by a similar but different approach it is likely to be a case of overfitting.

I don't want to disscourage the homebrew integration though. I think it would be very valuable and essential. I wish we could easily install bioinformatics software with a single command when we want it. Being able to run a recently published tool and other alternative approaches with ease would make the process of reproducing results so much simpler.

But what I absolutely don't think is necessary is to install an old version X of software Y just because someone used that version in their analysis some time ago. That just basically says: let's ignore everything that we learned since then and rewind to a time when the software was worse than it is today, we knew less and expected less.

If a study only works with an old version of just one software it is very likely not worth reproducing it.