Question

Forum:How Has Bioinformatics Improved Over Time?

12

Entering edit mode

11.2 years ago

Kate ▴ 370

The recent popularity of the article discussed here has led me to debate what many of my pure computer science friends see as big problems in bioinformatics—poor code quality, lack of testing and documentation, and unreliability. I said that things generally seem to be getting better over time, but they said they keep hearing about vague "improvement" in bioinformatics with no concrete examples.

As just an undergrad (in CS and Molecular Biology) with a short history in the field, I couldn't come up with much to say. So I turn to you.

Can you give any specific examples of how bioinformatics has improved over time? From specific tools and file formats to general practices?

Thanks for your time!

bioinformatics • 8.9k views

ADD COMMENT • link updated 14 months ago by Ram 43k • written 11.2 years ago by Kate ▴ 370

7

Entering edit mode

I am wondering if it really makes sense for bioinformatics to universally include the overhead associated with professional quality code. In some cases where the software is being used by hundreds, if not thousands of people, I guess the answer is obviously yes. However, I think a lot of ideas in science never go anywhere and it seems to me that it makes sense to write most code in whatever way makes sense for your small research group (as small proofs of concept) and then refactor the code later if the idea proves viable. There is an opportunity cost between exploring new ideas and exploiting those ideas to their fullest through professional quality implementations.

For myself, I would find it hard to do documentation, commenting, version control, testing etc for more than maybe one or two projects.

Anyway, I think measuring bioinformatics in terms of the quality of code is maybe not a great measure. It misses the science which is the accumulated knowledge about what algorithms and approaches apply to various biological problems. It's a little bit like measuring mathematical physics in terms of the quality of mathematics, while leaving out the advances in physics.

I think in addition to talking about the quality of the informatics, we should ask, what biology have we learned from bioinformatics that we would not have learned otherwise? How much (if at all) has bioinformatics increased the rate at which our knowledge about biology grows?

ADD REPLY • link 11.2 years ago by KCC ★ 4.1k

4

Entering edit mode

Note that Kate raised the question in the context of Fred Ross' rants on bioinformatics and the related HackerNews discussions. I guess one of the purposes serves to persuade more CS students and dedicated programmers to work on bioinformatics. For many of them, their interest is not in biology, at least not initially; their interest in bioinfo is to apply their skill sets on practical problems. If the bioinfo community think algorithms, code quality etc are not important, what is the point for a CS student or a professional programmer to join the field? Wouldn't a biologist with basic training in programming be able to solve most of the problems? From Fred's posts and CV, his primary interest seems to be programming. My feeling is his frustration mainly comes from the failure to pursue his interest. He is not alone: although most of answers/comments here rightfully indicate his post is severely biased, the post still gets 11 upvotes (personally, I only upvote a post for my support of its content); his opinion in the programming aspect is also echoed by serveral in HackerNews.

Maybe we can say "haters gonna hate", which is probably true, but Fred's post and more importantly the responses still make me think: how much room is there for CS students and dedicated programmers? Do we want to get more of them involved? On the interaction with computer scientists and professional programmers, is our bioinfo community moving in the right direction? If not perfect, can we do something on our side (without asking to reform the funding system, some aruged as the root of all evil) to improve the current situation? I haven't found answers myself.

PS: When I finish the comment, I realize it too long. But it is not answering Kate's question, so I will leave it here. I am sorry for the long mess without a clear conclusion.

ADD REPLY • link 11.2 years ago by lh3 33k

0

Entering edit mode

But isn't maintaining proper code analogous to sending reagents/organisms for a wet lab biologist? Is there really that much more overhead in support/refactoring code compared to keeping hundreds of stocks of flies/zebrafish/hybridomas?

ADD REPLY • link 11.2 years ago by Damian Kao 16k

3

Entering edit mode

I realized when I finished writing my response that I don't have an answer to the question of: who is going to do the refactoring? I worry that fixing or improving old code has no obvious reward in the ecosystem of "publish or perish".

ADD REPLY • link 11.2 years ago by KCC ★ 4.1k

score 14 · Answer 1 · 2013-01-28

Well I think the examples are so numerous that it is hard to know where to start.

For example it used to be that people processed read alignments in so called ELAND format, which just happened to be whatever output the CASAVA pipeline produced. Adding insult to injury, there used to be a normal ELAND, a sorted ELAND and and extended ELAND, each with its own quirks and inconsistencies. Tools only worked on certain outputs, then you had all kinds of strange converters from one mapping format to another. Today we have the SAM standard.

FASTQ format encodings used to be all over the map with various encodings, one never knew for sure which encoding the data were in. Today is all SANGER encoding.

When bowtie introduced the Burrows–Wheeler transform for aligning reads (not sure if they were the first) they radically transformed what is possible to do with sequencing data.

Tools like bedtools and bedops made vast amounts of previously written sloppy and ineffective code unnecessary.

From my point of view the problem with most bioinformatics results is not with the software or coding standards but the lack of proper experimental design behind the experiments that are being then processed with bioinformatics methods.

score 11 · Answer 2 · 2013-01-28

As Istvan said: too many real examples to mention. Each time a developer checks an update into a repository might be seen as an improvement (or at least, an incremental advance). How many times a day around the world does that happen? Thousands?

I suspect that what you are really referring to is an increased awareness among bioinformaticians concerning the importance of the things that you mention: documentation, code quality, testing, best practices. This too is improving, as more and more students pass through formal courses in bioinformatics and computational biology. There was a time when most bioinformaticians were self-taught, ex-wetlab biologists which probably contributed to the perception of bioinformatics programming as "amateurish".

I would encourage your pure computer science friends to examine the evidence: there is more to this topic than a single blog post that gets some attention on Hacker News :)

score 10 · Answer 3 · 2013-01-28

I am a big fan of APIs, libs, and PMs. I think the are a great example of "improvements" in bioinformatics. They let biologists to forgo the computer science and focus on their goals.

Here are some of the APIs that make bioinformatics more attainable:

http://www.bioperl.org/wiki/Main_Page

http://biopython.org/wiki/Main_Page

https://github.com/pezmaster31/bamtools

EDIT:

Almost forgot the guys and gals over a bioconductor:

http://www.bioconductor.org/

Each one of these tools / communities has improved bioinformatics - dates and history can be found at their sites.

score 7 · Answer 4 · 2013-01-28

There are no doubt still many problems with poorly developed code (from us biologists turned hackers). And, we can all complain about the annoyances of file formats. But to suggest that bioinformatics is not improving and generally worthless (as the linked rant does) is to be sublimely ignorant of both the current state of bioinformatics and its history. There are certainly many tools that still do not work out of the box. But, in my experience the success rate has massively improved. I have helped to teach several workshops on sequence analysis over the last 10 years. Currently, these begin with installing tophat, cufflinks, bedtools, samtools, etc on a bunch of laptops or on Amazon cloud. This generally takes just a few minutes of downloading pre-compiled binaries. Compare to the early days where you had to be much more of a system admin and were constantly compiling from source, hunting around for missing libraries etc. Another good comparison is the quality and polish of tools like the consed viewer to IGV and Savant. The former was useful in its day but the latter are truly amazing quality for open-source freeware. Not to mention many excellent commercial software packages like Ingenuity pathway analysis (and their new variant analysis), DNAnexus, Oncomine, etc which were all developed first in or seriously leverage experience from bioinformatics community. Ensembl and UCSC browsers continue to improve in quality every year. R and bioconductor while having a steep learning curve (and sometimes inadequate documentation) have also improved massively. One last point to consider is how rapidly bioinformatics has had to adapt to changing demands. Many of the problems we see today are the result of half-baked solutions to the problem of exponential growth in data production.

score 7 · Answer 5 · 2013-01-29

I have heard varieties of this conversation now for almost 2 decades. What irks me is that rarely is actual knowledge sought, rather it is an exercise in confirmatory bias for a persons career path and focus of study.

Many have posted some examples of improvements in code base and training new coders. These are important and interesting but for me bioinformatics is not just about code but also the interaction of code, data, accuracy and the need for timely results. I have worked in places where 'prototype' perl code, hacked together in a couple of days from the 'code cowboys' (a derogatory term used for us bioinformaticians by the 'real' programmers) was used for monthly data role-outs for the duration of long projects while mission-critical programs from the 'real' coders never materialised as the developmental cycle out lasted the need for the data. I know, "horses for courses" and all that, but an a-priori knowledge of the lifetime of code is sometimes a luxury as is the time for data delivery (see the insane 3-month data roll-out schedule that ensEMBL use to do (or perhaps still does?)).

The biggest improvement (and the area that still needs improved) is standards. Take gene names or function names for example. Pre Gene Ontology this was a nightmare. So many synonym comparison code which was never perfect as the data itself lent itself to mistakes. Now I stress much less about finding the correct gene for a given gene name or function and I have not reinvented (the code) for that particular wheel again. ..... Now if only chromosome names could also be standardised..(chr4=4=chrIV=IV.....)

score 6 · Answer 6 · 2013-01-29

6

Entering edit mode

11.2 years ago

Biojl ★ 1.7k

Bioconductor is a very good example of good documentation and reproducibility, as mentioned. I think it depends on the researchers and their background but at least the big projects are going in the good direction. Just take a look at the recent ENCODE project papers. They even have created virtual machines images with every single package and programme installed so you can reproduce exactly the same results and navigate, analyze and do whatever you want with them. The evolution from crappy perl scripts a few years ago to this is just mindblowing.

ADD COMMENT • link 11.2 years ago by Biojl ★ 1.7k

2

Entering edit mode

R/Bioconductor are also great examples of active improvement. Only 5 years ago or so, documentation was almost non-existent and what did exist was awful. New users had to more or less guess how to use it, by trial and error. The situation is much better now.

ADD REPLY • link 11.2 years ago by Neilfws 49k

Ram · Answer 7 · 2013-01-31

As noted by other there are many specific examples of where practice in bioinformatics has improved over time. I'd like to touch on a couple more general areas where things have improved since the wild west days in the late 1990s:

The use of community generated standards for reporting omics' data, such as MIAME, MIRIAM, etc (see a full listing at http://mibbi.org/)
The use of standardized ontologies to describe bioinformatics data, e.g. the Gene Ontology
The development and widespread use of systems like Taverna and Galaxy for the production of reproducible bioinformatics workflows.
The development of bioinformatics-specific operating systems like BioLinux and production of easy-to-install software package libraries like Debian Med.
The development of online communities for the exchange of bioinformatics knowledge like BioStars and SeqAnswers
The rise of use of blogs and twitter to push best practice through the global network of bioinformaticians.

The thing to keep in mind is that bioinformatics on the whole is both a young and fast moving field, so it is inevitable that there will be a lot of research grade software systems that get discarded along the path of progress, making the field look underdeveloped. This will be true as long as a big part of the bioinformatics community seeks to support "hot" biological technologies that develop at a fast rate. But this does not mean that the field as a whole has made no progress; far from it, bioinformatics is a much more mature discipline now than it ever has been.

score 1 · Answer 8 · 2013-01-31

1

Entering edit mode

11.2 years ago

Thaman ★ 3.3k

Distributed computing using Hadoop in various fields:

Hadoop-BAM
Hydra
Biodoop and many more
Quora

ADD COMMENT • link 11.2 years ago by Thaman ★ 3.3k

Ram · Answer 9 · 2014-05-12

1

Entering edit mode

10.0 years ago

mikhail.shugay 3.5k

I believe that there are several simple steps to overcome bad coding problems:

Minimalize the amount of auxillary scripts, try using community-tested pipelines.
If you're required to write a huge amount of code de-novo, try using something like Java which is easily deployable (handle your dependancies with e.g. Maven), testable and implicitly restricts you to good-coding practices
Use version control system
Cover all your code with unit testing
Use random tests on synthetic data
Ask others to run your code on their own, don't be afraid to show your results to biologists so they can manually check them for inconsistencies

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 10.0 years ago by mikhail.shugay 3.5k

1

Entering edit mode

There is a very recent rant I've read: Why bad scientific code beats following best practices.

http://www.yosefk.com/blog/why-bad-scientific-code-beats-code-following-best-practices.html

there is actually a lot of truth to this. Especially when using R/Bioconductor it used to be that I got mad at people that wrote ugly spaghetti code, that was kind of annoying, but then there are people that use "advanced" concepts of object orientendness, class factories, function delegation etc. where simply it is painful to try to even track down even what code is being called - few things are more depressing than code like that. I'll take the spaghetti.

ADD REPLY • link 10.0 years ago by Istvan Albert 100k

Ram · Answer 10 · 2014-05-12

0

Entering edit mode

10.0 years ago

BioApps ▴ 790

Some tools (and when I say "some tools", I am referring to MY tool) started to have graphic interface. For those with 10 years of Bash experience may not seem a major improvement but for the regular biologist it means a lot.

I am joking... It is not only mine. Some tools have been really improved. Probably more and more will have graphic interface.

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 10.0 years ago by BioApps ▴ 790

0

Entering edit mode

Graphical interface doesn't always equate to better. Yes, for large and complex software that does many, many things a graphical environment is great. For something you intend wet lab biologists to use routinely you probably want a graphical, or at least web, interface. But there are significant downsides to tools published that do straightforward tasks well that only have web or graphical interfaces, like not being able to easily plug them into processing pipelines.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 10.0 years ago by DG 7.3k