Any thoughts? I'm posting here to look for bioinformatics tools that would benefit from a speed up. But if it's more general command-line tools that tend to be bottlenecks, list those too.
If you want to list more or less than three, that's fine too, of course. If possible, prioritize them in order of impact on your daily life. I'm hoping this can be a resource for all those who, like me, might be interested in helping to speed up some tools.
One thing to consider is that for lots of operations on "big data" sets, disk I/O is the limiting factor (or even network speed), so speeding up the code may not help!
Architecture speedups are just as interesting and challenging as code speedups. So maybe I/O bound applications are worth listing as well.
But they can potentially cost significantly more :)
BTW: You must have something on your list, why not include those programs in your original post.
I don't have anything on my list; that's the problem!
I'm a physicist turned coder (like half the world, it seems). Outside of Rosalind and Coursera, I've never done any bioinformatics. But I've learned that I like speeding things up and I thought it would be fun to find out where I could make a practical contribution using that interest.
However, if it turns out that there aren't any tools that people wish were faster, then that's useful data too.
As @Chris has already indicated a lot of high throughput sequence data requires massive I/O which ultimately turns out to be a bottleneck.
There are many software packages that probably use routines that can use optimization from a master coder to make them run more efficiently (e.g. see this recent call from a fellow BioStar Call for requirement: a fast all-in-one FASTQ preprocessor (in C++, multi-threaded) ). Variant callers take a long time to process data and may be good candidates (freebayes is one example).
I/O has indeed become one of the biggest holdups. The community gave a lot of attention to CPU parallelisation in recent years and now we have a lot of functions for which parallelisation is enabled, and it's also easy to implement on old code written for single thread/CPU processing via forking. It's feasible to analyse multiple whole genome samples in a single day these days.
That said, it still takes >24 hours to back-up 1.5 terabytes of data (my personal and work files since 2000).
As I'm Irish, I'll also throw some humour into the discussion: years ago (2013) I attempted to back up ~2 terabytes of data from a local heavy-duty hard-disk (whole genome seq) to a NFS 'storage' drive on the same network. Windows came back with a predicted time of 154 years for it to complete, at which point my own mortailty was instantly thrown into question.
Thanks for these two leads.
That just what bad programmers want you to believe. ;)
dhbradshaw : As you can see we are all over the place with these wish lists.
If you are really a master programmer then your best bet is to have a developer of one of these tools contact you with very specific request(s) about accelerating parts of their code. If that will happen remains to be seen :-) I assume you have a limited amount of time available to dedicate to this sort of thing.
That path makes sense and is interesting.
I haven't claimed and will not claim to be a master developer.
But I don't think that's what's called for. What's called for is someone who will focus on a specific problem, read and get advice, think through the algorithm candidates and the effects of scaling on speed, make measurements, and iterate their way toward greater speed. It's more about work than about being a master developer. It's more about doing than about being.
It's important to realize that because if you don't then you miss out on opportunities to make contributions. And you don't grow toward "mastery" like you otherwise would either.
You have clearly done this before and are pragmatic. Hopefully you will be able to find an interesting project out of the several mentioned here to work on. If you are going to start new then something to do with variant calling/annotation would be immediately useful, as has been said elsewhere here. Let us know if you need introductory materials on that topic.
Combining your feedback with that of Chris Miller and others points with a fairly high signal toward variant calling and annotation. So I'm on board.
And yes, please! I do indeed need introductory materials on everything variant related.
Here is some material to get you started (sorry I missed this post). Video 1 and Video 2.
Do you have a preference for a language? Probably important to mention.
Rust -- I'd probably write anything over from scratch. So maybe smaller is better to start :-D
If I could double or triple the speed or throughput of my workplace coffee machine, this certainly would have the greatest impact on my daily life and productivity.
If you can find the source code for it then dhbradshaw will make it happen :)
I wish I had a 3D printer:
And many more! I wonder if they really work.