Question

Forum:Checking the quality and reliability in bioinformatics

0

Entering edit mode

6.8 years ago

zizigolu ★ 4.3k

Hi everyone,

How a bioinformatician could be ensured about the reliability, quality and reproducibility of his/her work/analysis?

Thank you

reproducibility • 2.9k views

ADD COMMENT • link updated 14 months ago by Ram 44k • written 6.8 years ago by zizigolu ★ 4.3k

score 5 · Accepted Answer · 2017-10-10

5

Entering edit mode

6.8 years ago

Alex Reynolds 35k

Write a GNU makefile and publish it with a manifest of all the software used and their versions, along with all the inputs used to do the analysis. Use open-source software where the source code is freely available and there is a history kept of all changes to source in an open repository.

ADD COMMENT • link 6.8 years ago by Alex Reynolds 35k

1

Entering edit mode

Write a GNU makefile

Solid advice, but I guess any other workflow (such as snakemake) would do? Or are there important differences?

ADD REPLY • link 6.8 years ago by WouterDeCoster 47k

1

Entering edit mode

In fact, since Snakemake has built in conda integration I wonder if it's generally preferable over standard makefiles when it comes to reproducibility.

ADD REPLY • link 6.8 years ago by Devon Ryan 104k

1

Entering edit mode

I'm sure snakemake is a great tool, but GNU make has been around for literally decades, and Python has, in my experience, fragility issues that GNU tools very rarely introduce. Python is good for smaller, one-off analyses, like Perl, but from a reproducibility standpoint, if I write a generic script and a major version release breaks backwards-compatibility that I have to troubleshoot, or if I have to wait days or weeks for a sysadmin to devote time to figure out how to reconstruct the exact combination of scipy, numpy, Python, OS kernel, etc. on our cluster, so that they work together without API issues and other errors, then I'd ask if I would be able to easily reproduce the exact environment of an analysis down the road, without significant debugging and testing effort on my part or on the part of others. I'm sure people are making inroads to this so that these won't be issues in 10-20 years, but, respectfully, I'm honestly not sure we're quite there, yet.

ADD REPLY • link 6.8 years ago by Alex Reynolds 35k

0

Entering edit mode

Sounds reasonable indeed - thanks for your insights!

how to reconstruct the exact combination of scipy, numpy, Python, OS kernel,

I guess most, but not all of these issues can be solved by using virtual (conda) environments, no?

ADD REPLY • link 6.8 years ago by WouterDeCoster 47k

2

Entering edit mode

For single-workstation environments controlled by an end user who is technically proficient, I'm sure that is easier to manage. But it is still effort to reproduce that environment and make sure it works. If you have a clustered environment, you need a sysadmin to manage the specific versions of dependencies required to make that all work, and you need to be able to have a way of deploying analyses to these virtual environments that is easy for others to reproduce.

If the question is about reproducibility, then I think simplicity is attractive and fragility and complexity are things to avoid, as a general philosophy.

ADD REPLY • link 6.8 years ago by Alex Reynolds 35k

1

Entering edit mode

When you start using clusters, it becomes easier but more specific. For example, our cluster uses modulecmd as module load, module list etc. For each module, there exist multiple versions with one of them being the default to be loaded when no version is specified with the module load command. When I document a script, I explicitly use a version number even when I am using the default version. I also document my script and specify the dependencies in the text so anyone who has to drill down to that level gets that information without needing to figure out log files to read through and tricks to use to deduce from said log files.

ADD REPLY • link 6.8 years ago by Ram 44k

1

Entering edit mode

Modules are great and we use them. Adding a specific version number to the module load command is a great tip.

Using module purge is also a good way to "clean the slate" before running a pipeline. Our developers been bitten by committing code that works locally, because they loaded a version of a module into their development environment, and their code fails in production when the rest of us use it because we're using other versions of that module. Purging modules can help keep this from happening.

Another complication is that modules can have dependencies, such that loading a module fails, if another module has not first been added. This can be specific to the lab's setup of these software packages.

But to some extent modules are self-documenting and can be a great way to run and compare multiple versions of tools.

ADD REPLY • link 6.8 years ago by Alex Reynolds 35k

0

Entering edit mode

I agree, module purge is a (scary-sounding) great way to start with a clean slate. True, modules have dependencies, but it's either trust the HPC folks to not remove modules or record all the ENV variables, and the PATH changes manually.

In fact, one of the ways I did this was by creating my own modulefiles (which loaded from the master modulefile) so I did not have to do a bunch of module loads each time.

ADD REPLY • link 6.8 years ago by Ram 44k

0

Entering edit mode

Sorry,

how a bioinformatician could demonstrate that he/she is felixble, with a high team work contribution and documentation ability?

ADD REPLY • link 6.8 years ago by zizigolu ★ 4.3k

0

Entering edit mode

Also, this is not your principal question - edit your top level post if you have multiple questions. You're starting off with reproducibility (a technical skill) and then moving on to flexibility and team-work (behavioral skills). What exactly do you want to know about?

ADD REPLY • link 6.8 years ago by Ram 44k

0

Entering edit mode

For example, while this recent question is not about snakemake, I think it suggests why keeping pipelines simple and as free of version dependencies as possible helps deal with the issue of fragility: Working on an old pipeline, need to gain access to a specific version of plinkseq

ADD REPLY • link 6.8 years ago by Alex Reynolds 35k

score 3 · Accepted Answer · 2017-10-10

While fancier options have been suggested below they may only be usable by fellow bioinformaticians. If you want to make your work generally accessible (to most anyone) then meticulous documentation may be sufficient. No detail should be considered too small/insignificant to include. Have a couple of people go through the worksheet to see if they can reasonably understand/follow your documentation/reasoning.

score 3 · Accepted Answer · 2017-10-10

3

Entering edit mode

6.8 years ago

Devon Ryan 104k

This seems timely: https://www.biorxiv.org/content/early/2017/10/10/200683

BTW, you'll need to open the PDF in a proper PDF viewer, the one in firefox won't render things nicely.

ADD COMMENT • link 6.8 years ago by Devon Ryan 104k

score 3 · Accepted Answer · 2017-10-10

3

Entering edit mode

6.8 years ago

WouterDeCoster 47k

And certainly, don't forget to

enter image description here

ADD COMMENT • link 6.8 years ago by WouterDeCoster 47k

2

Entering edit mode

I've been working as a lonely bioinformatician for the past 6 or so years... great to see that other bioinformaticians have come to have the same sense of humour as I!

ADD REPLY • link 6.8 years ago by Kevin Blighe 88k

0

Entering edit mode

All of us grew up with xkcd!

ADD REPLY • link 6.8 years ago by Ram 44k

0

Entering edit mode

please help to figure out what does a flexible bioinformatician with a high team work contribution mean??

I interpretated that as being ready to the adaptation of any changes within group but even for myself this explanation does not make sense

ADD REPLY • link 6.8 years ago by zizigolu ★ 4.3k

0

Entering edit mode

Where are you getting that phrase from?

ADD REPLY • link 6.8 years ago by Ram 44k

2

Entering edit mode

Probably a job advertisement.

ADD REPLY • link 6.8 years ago by GenoMax 144k

0

Entering edit mode

Actually not from job advertisement rather I am being asked these questions many times and each time I am mute to say what I think about

ADD REPLY • link 6.8 years ago by zizigolu ★ 4.3k

3

Entering edit mode

Something that comes to my mind (in a bioinformatics context) is that you could be flexible in your programming language. If the rest of the team uses Java - you learn to use java as well.

Except if it's Perl, then you make everyone change to Python.

Teamwork could also be that you understand how collaborative programming in git works - using branches, pull request, code review,...

ADD REPLY • link 6.8 years ago by WouterDeCoster 47k

2

Entering edit mode

If you asked me that, I'd tell people I know how to pick the right tool for the right job and combine them to create a solution. If Python fits somewhere, I'd use it. If R does something really well, I'd use that. If Excel does the task better than other tools, I will not hesitate to use that either.

At any point in time, I am working with multiple senior scientists in my lab on various projects, as well as heading my own projects. <add specific="" examples="" here="">. I balance priorities and get everyone's projects moving forward at a good pace.

When people ask you that, you always give specific examples. Start off with generic statements, but drill down to specifics and give details based on how people respond.

ADD REPLY • link 6.8 years ago by Ram 44k

1

Entering edit mode

I echo the giving-examples part. Make it easy on yourself and think upfront of an example which you can explain without hesitating.

If Excel does the task better than other tools, I will not hesitate to use that either.

Disgusting.

ADD REPLY • link 6.8 years ago by WouterDeCoster 47k

0

Entering edit mode

When I want to use a

SELECT X,COUNT(*) AS 'mycount' FROM table GROUP BY X ORDER BY mycount DESC

on structured but not-in-a-DBMS data, I'd much rather use Pivot Table in Excel 2013 or above. Excel is my state-preserving calculator app.

ADD REPLY • link 6.8 years ago by Ram 44k

1

Entering edit mode

Sure, every tool has its place in bioinformatics. And for Excel that place is the corner of shame :-D

In all seriousness, excel is great but being not fully reproducible is a serious issue.

ADD REPLY • link 6.8 years ago by WouterDeCoster 47k

2

Entering edit mode

In fairness, at least Excel corrupts bioinformatics datasets in reproducible ways.

ADD REPLY • link 6.8 years ago by Alex Reynolds 35k

0

Entering edit mode

Thank you guys, honestly without your answers that’s very unlikely I could find right answers from googling soon. I googled a lot and watched some interview videos in YouTube but finally I found my answers in biostars as always.

ADD REPLY • link 6.8 years ago by zizigolu ★ 4.3k

0

Entering edit mode

the explanation is my own but the question is one that I have been asked frequently where I could not find a clear explanation in googling.

ADD REPLY • link 6.8 years ago by zizigolu ★ 4.3k

0

Entering edit mode

Would like to mention about git-lfs for large file (bams, vcfs ...) which are very typical of bioinformatics pipelines and can not be handled by usual git (https://git-lfs.github.com/) . Applies more for analysis rather than software development.

ADD REPLY • link 6.8 years ago by microfuge ★ 1.9k

1

Entering edit mode

Yes - I use git for scripts, configuration files, commands, parameters and summary results. Indeed, you don't want to put large bams on standard git.

ADD REPLY • link 6.8 years ago by WouterDeCoster 47k

2

Entering edit mode

Put a directory of bam files under git
Index them
Rename indexes to the actual bam names
Run git diff
Go on (permanent) vacation

ADD REPLY • link 6.8 years ago by Ram 44k

0

Entering edit mode

You forgot to set a (permanent) out of office message!

ADD REPLY • link 6.8 years ago by WouterDeCoster 47k

0

Entering edit mode

Publishing inputs is probably important if you're publishing performance test results. When reading claims about tools performing faster or more accurately than existing tools, that could just as well be a narrow consequence of the inputs used for testing that are optimal to that toolkit, as much as how the tests were performed. A makefile documents the latter, but cannot address the former concern.

ADD REPLY • link 6.8 years ago by Alex Reynolds 35k

score 3 · Accepted Answer · 2017-10-10

3

Entering edit mode

6.8 years ago

Kevin Blighe 88k

My Scottish friend / colleague Mick Watson gives good advice here: http://www.nature.com/nbt/journal/v31/n11/full/nbt.2740.html?foxtrotcallback=true

(I'm Irish ... the 'same' as being Scottish).

ADD COMMENT • link 6.8 years ago by Kevin Blighe 88k

1

Entering edit mode

By the way, in one interview I was asked whether you want to be a biologist or computational biologist, then keeping in mind that I don’t know programming, I replied a biologist who knows many things in bioinformatics, however I was rejected. In this article I read a computational biologist no need to be a programmer necessarily. Really encouraging article Thank you

ADD REPLY • link 6.8 years ago by zizigolu ★ 4.3k

2

Entering edit mode

I think that your response was good, but maybe the employer wanted a more definitive answer.

There is still a lot of mis-understanding about what a bioinformatician (or computational biologist) does. Whilst this mis-understanding exists, you will always find employers with varying opinions on what you should be doing. As you get more senior, they eventually entrust you with all sorts of things covering statistics and simple data analyses - you'll be expected to understand the biology too (or pick it up quickly).

I gave a few presentations in the past on bioinformatics and I have always made the argument that everyone is a bioinformatician on some level: If they analyse any type of biological data, then that's bioinformatics on a fundamental level.

ADD REPLY • link 6.8 years ago by Kevin Blighe 88k