Question: Docker - one image to rule them all?
10
gravatar for Manuel
4.6 years ago by
Manuel370
Germany
Manuel370 wrote:

I am in the typical situation that I need a resequencing pipeline (i.e., FastQC, read preprocessing, FastQC again, alignment with BWA, variant calling). I need to fulfill both the requirements of having a stable pipeline with stable tools for the standard stuff (e.g., both "single-donor WES variant calling", "trio WES variant calling", but also "tumor/normal WES variant calling with somatic filtration") but I sometimes need more specialized functionality or more extensive downstream analysis.

I want to use Docker for isolating my tools against the uncertain, changing, and sadly oftentimes unversioned world of Bioinformatics software (I'm looking at you, vt and vcflib, but I'm still very grateful that you are around). What would be your recommendation for a best practice here:

  • one Docker image for everything, adding tools as I go
  • one Docker image for each pipeline step (e.g. combining BWA-MEM, samtools, samblaster for the alignment so I can use piping in a front-end script)
  • one Docker image for the standard stuff, then maybe some images for each additional step.

Does anyone know of a person/organization that has published their Dockerized pipeline stuff in a Blog post or elsewhere that goes beyond toy examples or "here is a Dockerfile for the tool that I wrote/published"?

Cheers!

docker • 3.2k views
ADD COMMENTlink modified 4.6 years ago by Jeremy Leipzig18k • written 4.6 years ago by Manuel370
7
gravatar for Tom
4.6 years ago by
Tom210
Germany
Tom210 wrote:

Illumina Base Space actually uses it to run their cloud based pipeline with Docker tools. http://basespace.illumina.com They keep it down to tools level.

Docker was initially built and optimized to run "one" process efficiently. Therefore it is advisable to keep it down to tools level. Also, if you think about pipeline building, you probably want to have a Lego-like tool box and combine different workflows with different - often the same - tools.  

As Illumina solved this, they have in each Docker container an input folder and an output folder, in Basespace you can actually publish your own tools as Docker container.

ADD COMMENTlink written 4.6 years ago by Tom210
4
gravatar for Giovanni M Dall'Olio
4.6 years ago by
London, UK
Giovanni M Dall'Olio26k wrote:

You can have a look at the ngeasy pipeline by the KHP Informatics group in London. They have a github repository, and a Makefile that installs all the components of the pipeline. Most of the components are in separate containers, facilitating the installation and updates.
 

ADD COMMENTlink written 4.6 years ago by Giovanni M Dall'Olio26k
2
gravatar for Amos
4.6 years ago by
Amos40
European Union
Amos40 wrote:

Hi Manuel,

Putting everything in one container, is one option, but I think there are distinct limitations here i.e. limitation on reuse & size of the image. The docker maxim is one "concern per container", and think this works well in this context. And as redundant layers are essentially re-used you don't have too much overhead if you design your images in a hierarchical manner. Of course separating tools in separate containers means passing input and output between them either through the STDOUT/IN or via shared volumes and this can be a bit fiddly.

As Giovanni mentioned, take a look at NGSeasy (*disclaimer* I'm one of the authors).

But this is by no means the only game in town, see Nextflow, Rabix, the work of Michael Barton at Nucleotide.es and also Bioboxes are trying to build a specification here for what a bioinformatics container should look like.

A plug for the docker symposium we are running towards the end of the year, bringing together various groups working in this space (keep an eye on the page, we'll be opening registration in May).

http://core.brc.iop.kcl.ac.uk/events/compbio-docker-symposium-2015/

 

Regards,

Amos

ADD COMMENTlink written 4.6 years ago by Amos40

From looking at NGSeasy Readme and source files, it's not quite clear to me yet how this pipeline can be run in a distributed computing environment (e.g. a Slurm cluster). Any comments on this?

ADD REPLYlink written 4.6 years ago by Christian2.8k
1
gravatar for matted
4.6 years ago by
matted7.2k
Boston, United States
matted7.2k wrote:

The bcbio-nextgen pipeline does a good job encapsulating standard alignment tasks and tracking tool versions.  They have a fully Dockerized version that is designed to run on AWS.  I'll copy a snippet of their README to go over the benefits:

  • Improved installation: Pre-installing all required biological code, tools and system libraries inside a container removes the difficulties associated with supporting multiple platforms. Installation only requires setting up docker and download of the latest container.
  • Pipeline isolation: Third party software used in processing is fully isolated and will not impact existing tools or software. This eliminates the need for modules or PATH manipulation to provide partial isolation.
  • Full reproducibility: You can maintain snapshots of the code and processing environment indefinitely, providing the ability to re-run an older analysis by reverting to an archived snapshot.
ADD COMMENTlink written 4.6 years ago by matted7.2k
1
gravatar for Jeremy Leipzig
4.6 years ago by
Philadelphia, PA
Jeremy Leipzig18k wrote:

This is a great question.

Typically it's one process per container, which is why there is Docker Compose (previously known as fig). However Compose is geared toward spawning running instances (databases, web servers), not pipelining, so a new framework might be necessary.

ADD COMMENTlink written 4.6 years ago by Jeremy Leipzig18k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1017 users visited in the last hour