Forum:Survey/Vote: If you could double the speed of any three commandline tools, which three would they be?
7
7
Entering edit mode
4.0 years ago
dhbradshaw ▴ 130

Any thoughts? I'm posting here to look for bioinformatics tools that would benefit from a speed up. But if it's more general command-line tools that tend to be bottlenecks, list those too.

If you want to list more or less than three, that's fine too, of course. If possible, prioritize them in order of impact on your daily life. I'm hoping this can be a resource for all those who, like me, might be interested in helping to speed up some tools.

RNA-Seq alignment blast next-gen sequencing Forum • 1.9k views
ADD COMMENT
4
Entering edit mode

One thing to consider is that for lots of operations on "big data" sets, disk I/O is the limiting factor (or even network speed), so speeding up the code may not help!

ADD REPLY
0
Entering edit mode

Architecture speedups are just as interesting and challenging as code speedups. So maybe I/O bound applications are worth listing as well.

ADD REPLY
0
Entering edit mode

But they can potentially cost significantly more :)

BTW: You must have something on your list, why not include those programs in your original post.

ADD REPLY
0
Entering edit mode

I don't have anything on my list; that's the problem!

I'm a physicist turned coder (like half the world, it seems). Outside of Rosalind and Coursera, I've never done any bioinformatics. But I've learned that I like speeding things up and I thought it would be fun to find out where I could make a practical contribution using that interest.

However, if it turns out that there aren't any tools that people wish were faster, then that's useful data too.

ADD REPLY
2
Entering edit mode

As @Chris has already indicated a lot of high throughput sequence data requires massive I/O which ultimately turns out to be a bottleneck.

There are many software packages that probably use routines that can use optimization from a master coder to make them run more efficiently (e.g. see this recent call from a fellow BioStar Call for requirement: a fast all-in-one FASTQ preprocessor (in C++, multi-threaded) ). Variant callers take a long time to process data and may be good candidates (freebayes is one example).

ADD REPLY
1
Entering edit mode

I/O has indeed become one of the biggest holdups. The community gave a lot of attention to CPU parallelisation in recent years and now we have a lot of functions for which parallelisation is enabled, and it's also easy to implement on old code written for single thread/CPU processing via forking. It's feasible to analyse multiple whole genome samples in a single day these days.

That said, it still takes >24 hours to back-up 1.5 terabytes of data (my personal and work files since 2000).

As I'm Irish, I'll also throw some humour into the discussion: years ago (2013) I attempted to back up ~2 terabytes of data from a local heavy-duty hard-disk (whole genome seq) to a NFS 'storage' drive on the same network. Windows came back with a predicted time of 154 years for it to complete, at which point my own mortailty was instantly thrown into question.

ADD REPLY
0
Entering edit mode

Thanks for these two leads.

ADD REPLY
0
Entering edit mode

disk I/O is the limiting factor

That just what bad programmers want you to believe. ;)

ADD REPLY
1
Entering edit mode

dhbradshaw : As you can see we are all over the place with these wish lists.

If you are really a master programmer then your best bet is to have a developer of one of these tools contact you with very specific request(s) about accelerating parts of their code. If that will happen remains to be seen :-) I assume you have a limited amount of time available to dedicate to this sort of thing.

ADD REPLY
0
Entering edit mode

That path makes sense and is interesting.

ADD REPLY
0
Entering edit mode

I haven't claimed and will not claim to be a master developer.

But I don't think that's what's called for. What's called for is someone who will focus on a specific problem, read and get advice, think through the algorithm candidates and the effects of scaling on speed, make measurements, and iterate their way toward greater speed. It's more about work than about being a master developer. It's more about doing than about being.

It's important to realize that because if you don't then you miss out on opportunities to make contributions. And you don't grow toward "mastery" like you otherwise would either.

ADD REPLY
1
Entering edit mode

You have clearly done this before and are pragmatic. Hopefully you will be able to find an interesting project out of the several mentioned here to work on. If you are going to start new then something to do with variant calling/annotation would be immediately useful, as has been said elsewhere here. Let us know if you need introductory materials on that topic.

ADD REPLY
0
Entering edit mode

Combining your feedback with that of Chris Miller and others points with a fairly high signal toward variant calling and annotation. So I'm on board.

And yes, please! I do indeed need introductory materials on everything variant related.

ADD REPLY
1
Entering edit mode

Here is some material to get you started (sorry I missed this post). Video 1 and Video 2.

ADD REPLY
0
Entering edit mode

Do you have a preference for a language? Probably important to mention.

ADD REPLY
0
Entering edit mode

Rust -- I'd probably write anything over from scratch. So maybe smaller is better to start :-D

ADD REPLY
0
Entering edit mode

If possible, prioritize them in order of impact on your daily life.

If I could double or triple the speed or throughput of my workplace coffee machine, this certainly would have the greatest impact on my daily life and productivity.

ADD REPLY
0
Entering edit mode

If you can find the source code for it then dhbradshaw will make it happen :)

ADD REPLY
0
Entering edit mode
ADD REPLY
4
Entering edit mode
4.0 years ago

In the interest of offering useful leads:

1) The neural networks used for MHC Class I and II epitope binding prediction. (sadly, most are not open source - see http://tools.iedb.org/mhci/ )

2) Variant annotation with VEP. It's an amazing tool, but is relatively slow.

3) Variant calling, especially for indels. Pindel, in particular, seems to often be a bottleneck in our pipelines.

ADD COMMENT
0
Entering edit mode

re 2), Illumina's recently open-sourced Nirvana, which looks pretty promising. .NET is annoying, but it seems like a solid piece of software (and OSS should always be encouraged)!

ADD REPLY
0
Entering edit mode

Really interesting code and applications here. MHC binding application is broken form me -- the url needs the leading http://.

ADD REPLY
4
Entering edit mode
4.0 years ago
5heikki 10.0k

In my pipelines, more often than not, GNU sort is the bottle neck.

ADD COMMENT
0
Entering edit mode

I wondered, would a linear sorting algorithm like radixsort improve that or are indexing and later random accessing the sorted lines the bottle neck. I mean, formats like bed or vcf have nicely defined columns that need to be sorted most of the times.

ADD REPLY
0
Entering edit mode

I think with large files the main bottle neck is IO

ADD REPLY
0
Entering edit mode

Do you use all the tricks to speed up sort? See discussion on Any Quick Sort Method For Large Bed File As 20G In Size?.

ADD REPLY
2
Entering edit mode
4.0 years ago
anu014 ▴ 190

I would like to name these:

  1. TrimGalore (https://www.bioinformatics.babraham.ac.uk/projects/trim_galore ) - No parallel processing
  2. Trinity (https://github.com/trinityrnaseq/trinityrnaseq/wiki )
  3. MAKER (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014 )

I wish denovo tools such as last 2 were faster..

ADD COMMENT
0
Entering edit mode

Can you annotate this list a little bit for outsiders like me? Also, I'm trying to understand priorities here since it looks like 1 should be the priority but then you seem to call out the last 2. Or are you saying that there are other de novo tools that you would list as well?

ADD REPLY
2
Entering edit mode

TrimGalore does quality trimming and adapter removal, but there are ultrafast and multithreaded alternatives out there, such as Skewer, so I think this is not the best type of tools to go for.

ADD REPLY
0
Entering edit mode

Wow! It supports multi-threading. I wasn't aware of this tool.Thanks @ATPoint!

ADD REPLY
0
Entering edit mode

But doesn't it support only paired-end data? "Skewer: a fast and accurate adapter trimmer for next-generation sequencing paired-end reads" (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-182)

ADD REPLY
1
Entering edit mode

No, it has a single-end mode, too. I think the authors wanted "paired" in the title, as they claim that it outperformed existing tools at that time in terms of paried-end accuracy.

ADD REPLY
0
Entering edit mode

Sure @dhbradshaw. I should have done that. As we have got alternative for TrimGalore (Skewer). I would like to omit it now.

Trinity is for de novo reconstruction of transcriptomes from RNA-seq data. We can mention Threads & memory in it. But still, it takes too much time even on a machine with 256GB memory & 32 cores. It took almost around 48-60 hrs to process 600GB data.

MAKER is basically used for de novo annotation of newly sequenced genomes. It runs HMM modelling in the background. I guess that's might be one of the reason for slowing it down. Threading is available for this too..

ADD REPLY
2
Entering edit mode
4.0 years ago
ATpoint 55k

SAMtools mpileup. Takes about 24h for a 50X WGS when directly piped into VarScan using a 2.6GHz Intel Xeon node. Multithreading would be a charm. Same goes for bam-readcount in case one uses the VarScan fpfilter.

ADD COMMENT
0
Entering edit mode

Out of curiosity, how do you normally run mpileup? - piped into bcftools call? I recently got back to DNA-seq after a sabbatical of 4 years away from it.

ADD REPLY
1
Entering edit mode

I use the mpileup-Varscan combination:

$SAMTOOLS mpileup -q 20 -Q 25 -B -d 1000 -f hg38.fa normal.bam tumor.bam | $VARSCAN somatic /dev/stdin ${OUTNAME} -mpileup --strand-filter 1 --output-vcf
ADD REPLY
1
Entering edit mode
4.0 years ago

Here's another one: Bam-readcount, which is fairly fast, until you try to get counts/info for 3M SNPs from a WGS bam file https://github.com/genome/bam-readcount

ADD COMMENT
1
Entering edit mode
4.0 years ago

miRanda for miRNA target prediction! Though, there is a CUDA implementation available but I don't think it's possible for everyone.

ADD COMMENT
1
Entering edit mode
4.0 years ago
  1. genome assembly tools (de novo)
  2. sequence alignment tools (such as bowtie)
ADD COMMENT

Login before adding your answer.

Traffic: 2123 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6