The recent popularity of the article A Farewell To Bioinformatics has led me to debate what many of my pure computer science friends see as big problems in bioinformatics—poor code quality, lack of testing and documentation, and unreliability. I said that things generally seem to be getting better over time, but they said they keep hearing about vague "improvement" in bioinformatics with no concrete examples.
As just an undergrad (in CS and Molecular Biology) with a short history in the field, I couldn't come up with much to say. So I turn to you.
Can you give any specific examples of how bioinformatics has improved over time? From specific tools and file formats to general practices?
Well I think the examples are so numerous that it is hard to know where to start.
For example it used to be that people processed read alignments in so called ELAND format, which just happened to be whatever output the CASAVA pipeline produced. Adding insult to injury, there used to be a normal ELAND, a sorted ELAND and and extended ELAND, each with its own quirks and inconsistencies. Tools only worked on certain outputs, then you had all kinds of strange converters from one mapping format to another. Today we have the SAM standard.
FASTQ format encodings used to be all over the map with various encodings, one never knew for sure which encoding the data were in. Today is all SANGER encoding.
When bowtie introduced the Burrows–Wheeler transform for aligning reads (not sure if they were the first) they radically transformed what is possible to do with sequencing data.
Tools like bedtools and bedops made vast amounts of previously written sloppy and ineffective code unnecessary.
From my point of view the problem with most bioinformatics results is not with the software or coding standards but the lack of proper experimental design behind the experiments that are being then processed with bioinformatics methods.
As Istvan said: too many real examples to mention. Each time a developer checks an update into a repository might be seen as an improvement (or at least, an incremental advance). How many times a day around the world does that happen? Thousands?
I suspect that what you are really referring to is an increased awareness among bioinformaticians concerning the importance of the things that you mention: documentation, code quality, testing, best practices. This too is improving, as more and more students pass through formal courses in bioinformatics and computational biology. There was a time when most bioinformaticians were self-taught, ex-wetlab biologists which probably contributed to the perception of bioinformatics programming as "amateurish".
I would encourage your pure computer science friends to examine the evidence: there is more to this topic than a single blog post that gets some attention on Hacker News :)
There are no doubt still many problems with poorly developed code (from us biologists turned hackers). And, we can all complain about the annoyances of file formats. But to suggest that bioinformatics is not improving and generally worthless (as the linked rant does) is to be sublimely ignorant of both the current state of bioinformatics and its history. There are certainly many tools that still do not work out of the box. But, in my experience the success rate has massively improved. I have helped to teach several workshops on sequence analysis over the last 10 years. Currently, these begin with installing tophat, cufflinks, bedtools, samtools, etc on a bunch of laptops or on Amazon cloud. This generally takes just a few minutes of downloading pre-compiled binaries. Compare to the early days where you had to be much more of a system admin and were constantly compiling from source, hunting around for missing libraries etc. Another good comparison is the quality and polish of tools like the consed viewer to IGV and Savant. The former was useful in its day but the latter are truly amazing quality for open-source freeware. Not to mention many excellent commercial software packages like Ingenuity pathway analysis (and their new variant analysis), DNAnexus, Oncomine, etc which were all developed first in or seriously leverage experience from bioinformatics community. Ensembl and UCSC browsers continue to improve in quality every year. R and bioconductor while having a steep learning curve (and sometimes inadequate documentation) have also improved massively. One last point to consider is how rapidly bioinformatics has had to adapt to changing demands. Many of the problems we see today are the result of half-baked solutions to the problem of exponential growth in data production.
I have heard varieties of this conversation now for almost 2 decades. What irks me is that rarely is actual knowledge sought, rather it is an exercise in confirmatory bias for a persons career path and focus of study.
Many have posted some examples of improvements in code base and training new coders. These are important and interesting but for me bioinformatics is not just about code but also the interaction of code, data, accuracy and the need for timely results. I have worked in places where 'prototype' perl code, hacked together in a couple of days from the 'code cowboys' (a derogatory term used for us bioinformaticians by the 'real' programmers) was used for monthly data role-outs for the duration of long projects while mission-critical programs from the 'real' coders never materialised as the developmental cycle out lasted the need for the data. I know, "horses for courses" and all that, but an a-priori knowledge of the lifetime of code is sometimes a luxury as is the time for data delivery (see the insane 3-month data roll-out schedule that ensEMBL use to do (or perhaps still does?)).
The biggest improvement (and the area that still needs improved) is standards. Take gene names or function names for example. Pre Gene Ontology this was a nightmare. So many synonym comparison code which was never perfect as the data itself lent itself to mistakes. Now I stress much less about finding the correct gene for a given gene name or function and I have not reinvented (the code) for that particular wheel again. ..... Now if only chromosome names could also be standardised..(chr4=4=chrIV=IV.....)
Bioconductor is a very good example of good documentation and reproducibility, as mentioned. I think it depends on the researchers and their background but at least the big projects are going in the good direction.
Just take a look at the recent ENCODE project papers. They even have created virtual machines images with every single package and programme installed so you can reproduce exactly the same results and navigate, analyze and do whatever you want with them. The evolution from crappy perl scripts a few years ago to this is just mindblowing.
As noted by other there are many specific examples of where practice in bioinformatics has improved over time. I'd like to touch on a couple more general areas where things have improved since the wild west days in the late 1990s:
The use of community generated standards for reporting omics' data, such as MIAME, MIRIAM, etc (see a full listing at http://mibbi.org/)
The use of standardized ontologies to describe bioinformatics data, e.g. the Gene Ontology
The development and widespread use of systems like Taverna and Galaxy for the production of reproducible bioinformatics workflows.
The development of bioinformatics-specific operating systems like BioLinux and production of easy-to-install software package libraries like Debian Med.
The development of online communities for the exchange of bioinformatics knowledge like BioStars and SeqAnswers
The rise of use of blogs and twitter to push best practice through the global network of bioinformaticians.
The thing to keep in mind is that bioinformatics on the whole is both a young and fast moving field, so it is inevitable that there will be a lot of research grade software systems that get discarded along the path of progress, making the field look underdeveloped. This will be true as long as a big part of the bioinformatics community seeks to support "hot" biological technologies that develop at a fast rate. But this does not mean that the field as a whole has made no progress; far from it, bioinformatics is a much more mature discipline now than it ever has been.
I believe that there are several simple steps to overcome bad coding problems:
Minimalize the amount of auxillary scripts, try using community-tested pipelines.
If you're required to write a huge amount of code de-novo, try using something like Java which is easily deployable (handle your dependancies with e.g. Maven), testable and implicitly restricts you to good-coding practices
Use version control system
Cover all your code with unit testing
Use random tests on synthetic data
Ask others to run your code on their own, don't be afraid to show your results to biologists so they can manually check them for inconsistencies
Some tools (and when I say "some tools", I am referring to MY tool) started to have graphic interface. For those with 10 years of Bash experience may not seem a major improvement but for the regular biologist it means a lot.
I am joking... It is not only mine. Some tools have been really improved. Probably more and more will have graphic interface.