How Do We Discourage Ad-Hoc Bioinformatic Analyses?
9
29
Entering edit mode
13.9 years ago

One habit I have noticed in a lot of labs and cores is the virus-like profileration of one-off ad-hoc analyses.

This is probably a bigger problem in bioinformatics than in most software shops, being that the development is experimental and non-commercial in nature and the languages are typically high-level. Still, this habit is detrimental to labs in terms of time wasted, lack of continuity when people leave, versioning, and difficulty in scale-up when transitioning to a high-throughput production environment.

What are some ways we can discourage the ad-hoc stuff in 2011 and encourage ourselves and others in our organization to develop analyses that are:

  1. Reproducible
  2. Reusable
  3. Modular
  4. Documented
software • 8.1k views
ADD COMMENT
17
Entering edit mode

What about encouraging grant agencies to support long-term software development first?

ADD REPLY
6
Entering edit mode
ADD REPLY
1
Entering edit mode

That's a good article.

If something works great and you don't care about how it works, that's one thing. But I've gained a much deeper understanding about things that I have reimplemented, whether or not my implementation ended up being useful. If you use a tool a lot, it's helpful to know what situations it works well for and where or why it can fail. Much of this you won't learn by observation or reading papers, but quickly becomes obvious when building your own implementation.

ADD REPLY
3
Entering edit mode

Without long term software support why should graduate students and PI's dedicate the time necessary to build modular, documented and reusable software? The PI's need to publish the graduate students need to graduate. I doubt that most graduate student committees would consider software improvements useful towards your phd thesis.

ADD REPLY
24
Entering edit mode
13.9 years ago

At the risk of provoking some people: why should we discourage ad-hoc bioinformatic analyses?

In my opinion ad-hoc analyses - be they in bioinformatics or not - are a central part of doing science. You have a question, you think up a way to answer that question, you answer it, you are done.

Speaking from the viewpoint of a PI, you are thus really wasting your time and my money if you spend extra effort on making your software reusable. Except, of course, if your software is able to do a sort of analysis that we often need in my lab.

The other big exception is if the goal is to develop a community resource. Here, the starting point is something completely different: we are not trying to test a specific scientific hypothesis, we are trying to develop a tool that will hopefully be widely used. However, in this case we are fundamentally entering the realm of engineering rather than science. Not that there is anything wrong with that.

To put it in the words of Russ Altman: "[...] it is useful to know when you are doing biology and when you are doing something else." (see his blog post on a closely related topic.)

ADD COMMENT
8
Entering edit mode

Jeremy, I completely agree that such scripts cannot be reused and refined. They were never meant to be, and they should not be. Where I disagree completely with you is that they become a burden - I simply ignore their existence and they are no burden. The way I see it, we are looking at a trade off here: surely you can waste time on having to constantly reinvent the wheel because people do not write reusable software, but you can also waste incredible amounts of time on writing reusable software that are nonetheless never reused because the job only needed to be done once.

ADD REPLY
3
Entering edit mode

Excellent points. This is the fun challenge of software engineering in bioinformatics. Starting with ad-hoc, one off analyses, how can you extract out reusable code and functionality that make it faster for you, and others, to do new analyses in the future? This is analogous to problems in web development, where everyone is building their own custom website but make use of productivity enhancing frameworks like Rails.

ADD REPLY
1
Entering edit mode

It is true that many scripts are one-off by nature but it is also true that a good programmer tends to write reusable modules without investing additional time, although one may argue training to be a good programmer itself takes time may not be applicable to everyone. In my view, ad hoc analyses cannot be eliminated, but can be reduced. We should not take it for granted.

ADD REPLY
1
Entering edit mode

Joachim, I agree that typically a script will after a short while have problems redoing an analysis due to, in particular, changes in file formats. Fortunately, in terms of reproducibility, what matters is if it can redo the analysis on the same input data. This is why I always try to make sure that both the scripts AND the input data for them are archived. That way you can ensure reproducibility without having to maintain everything you ever made. It is also why I try to avoid unnecessary external dependencies - they tend to break over time.

ADD REPLY
0
Entering edit mode

Ad hoc bioinformatics sounds all well and good until you enter a lab or core that has hundreds of little disconnected and undocumented scripts. Each of these scripts served some purpose once but if they cannot be reused or refined they become a burden rather than a resource.

ADD REPLY
0
Entering edit mode

Ad-hoc analysis-scripts are very useful, but essentially any software that is not maintained or at least documented will drift off into oblivion. Lets face it: most ad-hoc scripts do not even run on the original problem anymore in a couple of months time, because the input data-format changed, libraries have changed, paths on the server changed or something else trivial.

If you write throwaway scripts, then that is fine. If we do research like this, then we should stop talking about the reproducibility of results in science -- because that will be nearly impossible.

ADD REPLY
0
Entering edit mode

Okay, Lars, you can certainly do it like this and I would consider it an excellent approach to maintain reproducibility (keeping scripts & original data together). I think in that particular case it is possible to keep documentation/comments regarding the script at a minimum.

I think documentation mainly becomes an issue if you cannot keep the original input-data (e.g. in cases where it requires too much storage space over time), or you simply lack the discipline to archive scripts/data consistently.

ADD REPLY
13
Entering edit mode
13.9 years ago
Joachim ★ 2.9k

This is a very delicate topic and I fully agree that many labs seem to be run in a very laissez-faire way. I would say that the most interesting research has to be done in a playful manner, where the researcher is granted enough free room to test and verify his/her hypothesis. Unfortunately, from there on, once an idea culminates into publishable results, a patchwork approach kicks-in to wrap up the results in a publication without the hindsight of reproducibility or extendibility.

As Pierre pointed out, there are tools and web-sites out there who help you manage a project. Most of these tools/web-sites are free, so there money cannot be an issue here. The crux of the matter is in my opinion much simpler: bad management.

From having seen several working environments, I would say that you only get reproducible, reusable, modular and documented analysis by running a team of people with an iron fist. Whilst that sounds horrible, it can be ironically a rather pleasant environment to work in, because:

  • everyone knows his/her own responsibilities
  • people in the team know what their colleagues are doing
  • you can actually look things up in the wiki/content-management software/project-management software system, because everyone uses it
  • there is a goal to achieve and you and your colleagues or on to it

Besides that, bad research/development practise is also encouraged due to grant process. Most grants do not cover for running costs, such as the continued maintenance of web-services or online databases. Essentially, there is no incentive to keep working on a project that has no (monetary) future.

In a couple of years, chaotic approaches to bioinformatics might become unviable as soon as funding is cut and/or more stringent submission guidelines are put in place. Only recently, I read this post, which might (or might not) change the outstanding problems in bioinformatics.

Until then, there is no other way than to harass those who are in charge to inform themselves about:

But again, it is more important that the people involved in a project understand that good management can help than, rather than seeing it as an annoyance. After all, everyone could right now start with SVN/git, Bugzilla, and a Wiki straightaway.. and yet, only a few people use these simple tools.

EDIT: Dummy edit for freesci. Thanks for letting me know, but I would have not worried about it. :)

ADD COMMENT
0
Entering edit mode

Joachim, I'd accidentally down-voted your answer, and before I've realized, it was too late ("vote too old to be changed"). Can you edit your answer? I could change my vote then.

ADD REPLY
8
Entering edit mode
13.9 years ago
lh3 33k

To me, program reusability and result reproducibility are two different issues.

The necessity of program reusability is group dependent, project dependent and person dependent. To make a program reusable, we need to pay quite a lot of efforts both on improving our programming skills and on designing/documenting the program. Not everyone can become or have the time to become a programmer good enough to achieve this. Even for good programmers, they have to evaluate whether making a program reusable is worthwhile: if the program is one-off by nature, investing more time only leads to more wastes.

While reusability definitely helps reproducibility, result reproducibility can be achieved without program reusability. In my opinion, we can greatly improve reproducibility if we a) keep all the intermediate scripts and b) keep enough intermediate results. Moving from the raw data directly to the final results is difficult, but moving from one step to the next is much simpler. If every published paper were accompanied with enough intermediate information, our life would be much easier.

EDIT: putting programs under SVN/etc also helps.

ADD COMMENT
3
Entering edit mode

@apfejes: a local source control system (such as SVN) can be used by more than one developer. A more public system is only of interest if there is a wish to get external contributions.

ADD REPLY
1
Entering edit mode

I agree with you both. Firstly, even if you do not intend to get external contributions, releasing the source codes under a public version control may attract other developers who will offer unexpected but useful help you cannot get otherwise. I benefit from this myself. On the other hand, if you feel uneasy to disclose the source code, even putting your own pieces of scripts in a private repository (e.g on your own laptop) will help.

ADD REPLY
0
Entering edit mode

I agree with everything in the above comment, but I think it's not only important to get it into an SVN, but to get it into a publicly accessible SVN. An SVN behind closed doors does help with the development process, but doesn't benefit anyone but the programmer.

ADD REPLY
0
Entering edit mode

I agree with you both. Firstly, even if you do not intend to get external contributions, releasing the source codes under a public version control may attract other developers who will offer unexpected but useful help you cannot get otherwise. I benefit from this myself. On the other hand, if you feel uneasy to disclose the source code, even putting your own pieces of scripts in a private repository (e.g on your own laptop) will benefit.

ADD REPLY
6
Entering edit mode
13.9 years ago
apfejes ▴ 160

This is something I've spent a lot of time thinking about, since I've written a fair amount of bioinformatics software. Unfortunately, there are few if any rewards for doing this in an academic lab - thesis committees, fellow academics and researchers aren't interested in re-usable software, they're only interested in results.

However, There is one angle that I often use: I believe reusable software leads to less error prone analyses. Even better, software with more users and more developers can lead to better quality software, which makes analysis more reliable.

This mentality is visible in the Open Source community, where sayings such as Linus's Law, (Eric S. Raymond) states that "given enough eyeballs, all bugs are shallow".

Ideally, moving towards open source bioinformatics would make a significant difference in reducing redundant code, creating modular libraries that everyone can inspect and reuse, and of course, would create a fast way to obtain reproducible results. As far as documentation goes, opening the code up would give developers access to the underlying information about the mechanics. It's a win-win situation, really.

However, the fundamental problem is getting buy in from the community. I also have a long list of reasons why academics won't join into open source code projects, but that's probably a topic for another day.

ADD COMMENT
0
Entering edit mode

apfejes, I'd love to hear your long list of reasons about why you think academics won't join these open source code projects.

ADD REPLY
5
Entering edit mode
13.9 years ago

Isn't it the aim of http://www.myexperiment.org or/and http://www.taverna.org.uk/ ? But I don't think it has been successful ...

ADD COMMENT
4
Entering edit mode
13.9 years ago

Two points might be mixed here:

  • reproducibility / documentation (in the sense of what was done in a given analysis)
  • reusable, modular, documentation (in the sense of code documentation for software development)

The first point is part of good practice in all science fields and is something that can keep us debating for a long time; the documentation part is the overlapping part, and there plenty of nice ideas coming from literate programming (R's Sweave for example).

A significant fraction of bioinformaticians are not software developers at heart. This is not judgmental as this has both good and bad aspects to it, but a bad aspect is that tools from software development that would help with reproducibility are largely unknown or looked down upon.

Regarding the second point, bioinformatics work can often be placed on a plane with two axes:

  • Help answer questions of biological interest in a given dataset
  • Develop tools that help answer questions of biological interest (this can be theoretical work, or implementation work).

[Edit: Lars posted while I was writing - Russ Altman's blog refers to something similar. However, I disagree with the wording because of the material published in the Journal of Computational Biology: the journal focuses on methods, to quote them "The Journal publishes peer-reviewed papers focusing on novel, cutting-edge methods in computational biology and bioinformatics."]

Developing tools can occur during some of the larger projects, but as it was nicely said in other answers developing good modularity and extensive documentation does not come without a cost. If biology is the main interest, modularity, reusability and maintainability goes down the list of priorities.

Functionality is something that commercial software development wants to freeze as early as possible because late changes cost a lot, and on the other hand when doing research one often does not exactly know what will be found and what will be most useful. Also, software development shops with an aim at support and extended development cycles have developed rules such strong guidelines for writing code (naming rules for variables, documentation rules), a preferred development methodology. In fact distinct people can be working on software design while others implement the design.

As long as there is no penalty for writing unmaintainable code (we all understand what means selection pressure, right ? ;-) ) those will not be much adopted. Other answers mention that getting funding for support would correct it; I do not think it would completely without a competitive advantage for doing so. The metric of success for grants in the short term is typically publication(s) (the more prestigious the journal(s) the better), and possibly number of citations in the long term. Well-supported software would help the number citation... if only software was cited whenever it is used (personally I can't complain too much, but take the biopython project for example: 10 years of existence and continued support for twice ~50 citations - I'd happy to hear from the biopython folks how many times it was downloaded in the meantime).

Finally, I do not think that the hypothesis that the relatively poor state of software in some places is to be attributed to the non-commercial nature of the work or the fact that "high-level" languages are used holds. There are many open source projects that are handled impressively well, and there is plenty of unmaintainable code in the business implemented in "low-level" languages.

To get more modular and reusable code:

  • train people in software design and software engineering (or apply selective pressure and only hire people with those skills).
  • cite more software (number of citations can eventually mean something to funding agencies)
ADD COMMENT
0
Entering edit mode

Regarding the Biopython citations, if you are looking at the Application Note it was published when Biopython was already about 10 years old. During the first 10 years there were at least 150 citations (manually compiled list using Google etc). BioPerl did well to get cite-able manuscripts out early. Even now there is a 'proper' Biopython paper to cite, people using Biopython don't always cite it - this is a general issue though.

I'd love to know download counts for Biopython, BioPerl, etc, but that is difficult to count due to things like packages in Linux distributions.

ADD REPLY
0
Entering edit mode

I did a crude Google Scholar search to get that number, that does not appear so far off from the manually compiled one (I got 2x50=100 vs "at least 150 over 10 years"). Exact download counts are certainly very hard to get, but taking the ones on the biopython site would already give a lower estimate. I suspect that there are several thousand per year there, and put in perspective with the "at least 15 (150 / 10) citations per year" it would indicate that the project does not get the credit it deserves.

ADD REPLY
2
Entering edit mode
13.8 years ago
brentp 24k

Great question, and answers so far -- especially the one by @Lars that raises the point that maybe we shouldn't be discouraging ad-hoc analyses.

Ad-hoc analyses are how we discover new things or create new pipelines. The question is then how to turn the perfunctory scripts into something that can be used by someone outside of the lab (or the author's head) to reproduce results or run their own pipeline.

I have found that using revision control is very helpful for this. That, combined with the "social coding" aspects of sites like github or bitbucket make it more likely that I will take a simple script, give it a reasonably intuitive command-line interface, some documentation, (and maybe even tests) and put it up on a site where it can be reused. With those sites, I can also "follow" users and see what they are working on so I'm more likely to know if they've already created something to do the otherwise one-off analysis I'm about to do.

This isn't a complete solution, but I think that having a simple social incentive like that to make stuff more usable can go a long way. Larger labs could offer some sort of internal incentive to promote better documentation and interfaces (better code is another story).

ADD COMMENT
1
Entering edit mode
9.7 years ago

I think Shaun Jackman's 101 Q's answer pretty much sums up my feelings about this (with the possible replacement of Make with Snakemake)

Use Make to automate every analysis pipeline. No pipeline is too small or too large. A one-off analysis never is.

ADD COMMENT
0
Entering edit mode

Snakemake all the things

ADD REPLY
0
Entering edit mode
9.7 years ago

Great question, it is all about objective of the project and long term goals of person handling the project. I agree that documentation is necessary. But there are several labs which just focus on their products they need, may be they will never repeat this experiment or these analyses again. PIs lack time and everyone running with their purposes. But your four points are necessary for any walk of life, from cooking to repairing notebook. It is must to encourage from childhood so that new researchers follow it. Systematic working and arrangements of lab from day #1 are must to have skills for newer labs.

ADD COMMENT

Login before adding your answer.

Traffic: 1345 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6