I would say this is not something unique to Bioinformatics, rather, just the direction that software development is heading in general.
Back in the old days when processors only had limited capacity, software developers needed to write their code as efficiently as possible, often optimising it for the CPU architecture it was supposed to run on. Speed of execution was king.
As processors became exponentially more powerful, and compilers became as good if not better than hand-tuned C, speed of execution took a back bench to speed of development. Higher level languages which abstracted complexity away became (and still are) incredibly popular.
Now that development time for even the most complicated apps is measured in weeks/months and not years, focus has shifted from development time back to execution time, but this hasnt been easy because it means giving up a lot of that abstraction current-day developers are used to. For example, it is often 1000x more performant to used typed arrays (Cython/numpy) for example, than whatever 16byte blobs things Python's data objects are typically stored as - but this speed comes with restrictions, problems, and a certain level of technical expertise which all costs money to whoever is funding the project.
When internet adoption exploded, Facebook/Twitter/Google, etc didnt have time to redesign programming paradigms, and 'solved' the problem of scale by parallelization. Hadoop, map/reduce, Google Bigdoc, etc. Huge data centres, server clusters, etc etc. I think its pretty well established these days among HPC experts that this was a bad trend for everyone else to follow. Often the structure of Hadoop overshadows the fact that the code and messaging that Google uses is, itself, extremely well written to begin with. But other developers picked it up because it allows them to offload their work onto operations - the people who buy/maintain the hardware the code runs on. Code not fast enough? Buy more Solid State Hard Drives! I will fix my code as a last-resort.
The conclusion - the language probably made no difference and is not what increased their performance by 3x - it was just that they rewrote all their code and reduced it by 40% that actually made things faster.
My point is, we are heading into a new era of program development, where algorithm design - which has nothing to do with the CPU hardware, parallelization, or the programming language it runs on, is king. This is where the 10x, 100x, or even 1000x speedups can be found.
"Cutting out the unnecessary complexities of processing and managing data will be beneficial to everyone" - exactly right! But I wouldn't say this is going to be as simple task. To cut out the unnecessary computations in an algorithm requires a near god-like knowledge of all possibilities in input/output/computation. To put it more poetically: the daily activities of a child is extremely simple. Their inputs and outputs are, not complex... but it still takes an adult, aware of the entire picture that is a human life, to design this incredibly boring day.
By the writing on your wall, looks like I'll be out of a job soon if I don't evolve :)
As a scientist, you're always out of a job if you don't evolve. :)
well it all depends ;-) but I don't think being out of job will be a threat though - it is more like running cuffdiff is not what bioinformatics is supposed to be
On the other hand, even small labs are doing dozens of whole genomes, requiring more infrastructure and analysts who are more than technicians.
MinION? Last time I checked it had crazy error rates along with lots of chimeric sequences (we found sequences that were not supposed to be there - with their control lambda DNA!). Has this thing improved since then?
it is an early platform and as such may have many problems. Users report notable improvements in quality with new chemistry releases, I will have the chance to evaluate it myself very soon.
But even today I can see that once we start dealing with 160Kb reads everything that we think we know about bioinformatics needs to be reevaluated - it is like a naked emperor situation - what we call bioinformatics is really working around the limitations of producing billions of very short reads. What we really want is one read that corresponds to the chromosome, the transcript, the bound DNA, the actual unit of DNA under study etc.
Come to think of it is a BAM file suited to represent a 160Kb alignment? Not so clear. Remember BAM is supposed to act like a indexed database to query an interval from the hundreds of millions entries. But if I have only a few thousand why bother with that.
It may sound ridiculous to say that: well that BAM files did not exist 10 years ago and perhaps it will not be used 10 years from now - but I think it will come true.
SAM/BAM was designed even with whole genome alignment in mind. Multiple segments per read, hard clipping and R-tree index were particularly geared towards long reads. In the era of 35bp reads, we don't need these features. SAM has issues with long reads because we lacked use cases at that time, but these issues are hard to solve anyway. If I redesign a new long-read alignment format from scratch, I am not sure I can do much better. That said, I also believe we will be doing alignment less and less in future. SAM will die ultimately. BAM will die sooner. Personally, I have been looking forward to long reads when I designed SAM/BAM, developed bwa-sw and later bwa-mem. Long reads will rule. It is just a matter of time.
On thing that should/will change IMO is that that once we will measure (really) long sequences we could end up in situations where line oriented I/O is not efficient anymore. We can't process files where the entire sequence for chr1 is placed on a single line.
Works pretty well for us, see: