Forum: Survey/Vote: If you could double the speed of any three commandline tools, which three would they be?
7
gravatar for dhbradshaw
13 months ago by
dhbradshaw130
United States
dhbradshaw130 wrote:

Any thoughts? I'm posting here to look for bioinformatics tools that would benefit from a speed up. But if it's more general command-line tools that tend to be bottlenecks, list those too.

If you want to list more or less than three, that's fine too, of course. If possible, prioritize them in order of impact on your daily life. I'm hoping this can be a resource for all those who, like me, might be interested in helping to speed up some tools.

ADD COMMENTlink modified 13 months ago by Puli Chandramouli Reddy150 • written 13 months ago by dhbradshaw130
4

One thing to consider is that for lots of operations on "big data" sets, disk I/O is the limiting factor (or even network speed), so speeding up the code may not help!

ADD REPLYlink written 13 months ago by Chris Miller20k

Architecture speedups are just as interesting and challenging as code speedups. So maybe I/O bound applications are worth listing as well.

ADD REPLYlink written 13 months ago by dhbradshaw130

But they can potentially cost significantly more :)

BTW: You must have something on your list, why not include those programs in your original post.

ADD REPLYlink modified 13 months ago • written 13 months ago by genomax59k

I don't have anything on my list; that's the problem!

I'm a physicist turned coder (like half the world, it seems). Outside of Rosalind and Coursera, I've never done any bioinformatics. But I've learned that I like speeding things up and I thought it would be fun to find out where I could make a practical contribution using that interest.

However, if it turns out that there aren't any tools that people wish were faster, then that's useful data too.

ADD REPLYlink written 13 months ago by dhbradshaw130
2

As @Chris has already indicated a lot of high throughput sequence data requires massive I/O which ultimately turns out to be a bottleneck.

There are many software packages that probably use routines that can use optimization from a master coder to make them run more efficiently (e.g. see this recent call from a fellow BioStar Call for requirement: a fast all-in-one FASTQ preprocessor (in C++, multi-threaded) ). Variant callers take a long time to process data and may be good candidates (freebayes is one example).

ADD REPLYlink modified 13 months ago • written 13 months ago by genomax59k
1

I/O has indeed become one of the biggest holdups. The community gave a lot of attention to CPU parallelisation in recent years and now we have a lot of functions for which parallelisation is enabled, and it's also easy to implement on old code written for single thread/CPU processing via forking. It's feasible to analyse multiple whole genome samples in a single day these days.

That said, it still takes >24 hours to back-up 1.5 terabytes of data (my personal and work files since 2000).

As I'm Irish, I'll also throw some humour into the discussion: years ago (2013) I attempted to back up ~2 terabytes of data from a local heavy-duty hard-disk (whole genome seq) to a NFS 'storage' drive on the same network. Windows came back with a predicted time of 154 years for it to complete, at which point my own mortailty was instantly thrown into question.

ADD REPLYlink written 13 months ago by Kevin Blighe33k

Thanks for these two leads.

ADD REPLYlink written 13 months ago by dhbradshaw130

disk I/O is the limiting factor

That just what bad programmers want you to believe. ;)

ADD REPLYlink written 13 months ago by kloetzl990
1

dhbradshaw : As you can see we are all over the place with these wish lists.

If you are really a master programmer then your best bet is to have a developer of one of these tools contact you with very specific request(s) about accelerating parts of their code. If that will happen remains to be seen :-) I assume you have a limited amount of time available to dedicate to this sort of thing.

ADD REPLYlink modified 13 months ago • written 13 months ago by genomax59k

That path makes sense and is interesting.

ADD REPLYlink written 13 months ago by dhbradshaw130

I haven't claimed and will not claim to be a master developer.

But I don't think that's what's called for. What's called for is someone who will focus on a specific problem, read and get advice, think through the algorithm candidates and the effects of scaling on speed, make measurements, and iterate their way toward greater speed. It's more about work than about being a master developer. It's more about doing than about being.

It's important to realize that because if you don't then you miss out on opportunities to make contributions. And you don't grow toward "mastery" like you otherwise would either.

ADD REPLYlink written 13 months ago by dhbradshaw130
1

You have clearly done this before and are pragmatic. Hopefully you will be able to find an interesting project out of the several mentioned here to work on. If you are going to start new then something to do with variant calling/annotation would be immediately useful, as has been said elsewhere here. Let us know if you need introductory materials on that topic.

ADD REPLYlink written 13 months ago by genomax59k

Combining your feedback with that of Chris Miller and others points with a fairly high signal toward variant calling and annotation. So I'm on board.

And yes, please! I do indeed need introductory materials on everything variant related.

ADD REPLYlink written 13 months ago by dhbradshaw130
1

Here is some material to get you started (sorry I missed this post). Video 1 and Video 2.

ADD REPLYlink written 12 months ago by genomax59k

Do you have a preference for a language? Probably important to mention.

ADD REPLYlink written 13 months ago by WouterDeCoster35k

Rust -- I'd probably write anything over from scratch. So maybe smaller is better to start :-D

ADD REPLYlink written 13 months ago by dhbradshaw130

If possible, prioritize them in order of impact on your daily life.

If I could double or triple the speed or throughput of my workplace coffee machine, this certainly would have the greatest impact on my daily life and productivity.

ADD REPLYlink written 12 months ago by h.mon21k

If you can find the source code for it then dhbradshaw will make it happen :)

ADD REPLYlink written 12 months ago by genomax59k

I wish I had a 3D printer:

https://grabcad.com/library/coffee-machine-for-office-1

https://www.cadblocksfree.com/en/commercial-coffee-machine.html

And many more! I wonder if they really work.

ADD REPLYlink written 12 months ago by h.mon21k
4
gravatar for Chris Miller
13 months ago by
Chris Miller20k
Washington University in St. Louis, MO
Chris Miller20k wrote:

In the interest of offering useful leads:

1) The neural networks used for MHC Class I and II epitope binding prediction. (sadly, most are not open source - see http://tools.iedb.org/mhci/ )

2) Variant annotation with VEP. It's an amazing tool, but is relatively slow.

3) Variant calling, especially for indels. Pindel, in particular, seems to often be a bottleneck in our pipelines.

ADD COMMENTlink modified 13 months ago • written 13 months ago by Chris Miller20k

re 2), Illumina's recently open-sourced Nirvana, which looks pretty promising. .NET is annoying, but it seems like a solid piece of software (and OSS should always be encouraged)!

ADD REPLYlink written 13 months ago by tpoterba30

Really interesting code and applications here. MHC binding application is broken form me -- the url needs the leading http://.

ADD REPLYlink written 13 months ago by dhbradshaw130
4
gravatar for 5heikki
13 months ago by
5heikki7.9k
Finland
5heikki7.9k wrote:

In my pipelines, more often than not, GNU sort is the bottle neck.

ADD COMMENTlink written 13 months ago by 5heikki7.9k

I wondered, would a linear sorting algorithm like radixsort improve that or are indexing and later random accessing the sorted lines the bottle neck. I mean, formats like bed or vcf have nicely defined columns that need to be sorted most of the times.

ADD REPLYlink written 13 months ago by Aerval280

I think with large files the main bottle neck is IO

ADD REPLYlink written 13 months ago by 5heikki7.9k

Do you use all the tricks to speed up sort? See discussion on Any Quick Sort Method For Large Bed File As 20G In Size?.

ADD REPLYlink modified 13 months ago • written 13 months ago by h.mon21k
2
gravatar for anu014
13 months ago by
anu014160
India
anu014160 wrote:

I would like to name these:

  1. TrimGalore (https://www.bioinformatics.babraham.ac.uk/projects/trim_galore ) - No parallel processing
  2. Trinity (https://github.com/trinityrnaseq/trinityrnaseq/wiki )
  3. MAKER (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014 )

I wish denovo tools such as last 2 were faster..

ADD COMMENTlink modified 13 months ago • written 13 months ago by anu014160

Can you annotate this list a little bit for outsiders like me? Also, I'm trying to understand priorities here since it looks like 1 should be the priority but then you seem to call out the last 2. Or are you saying that there are other de novo tools that you would list as well?

ADD REPLYlink written 13 months ago by dhbradshaw130
2

TrimGalore does quality trimming and adapter removal, but there are ultrafast and multithreaded alternatives out there, such as Skewer, so I think this is not the best type of tools to go for.

ADD REPLYlink written 13 months ago by ATpoint11k

Wow! It supports multi-threading. I wasn't aware of this tool.Thanks @ATPoint!

ADD REPLYlink written 13 months ago by anu014160

But doesn't it support only paired-end data? "Skewer: a fast and accurate adapter trimmer for next-generation sequencing paired-end reads" (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-182)

ADD REPLYlink written 13 months ago by anu014160
1

No, it has a single-end mode, too. I think the authors wanted "paired" in the title, as they claim that it outperformed existing tools at that time in terms of paried-end accuracy.

ADD REPLYlink written 13 months ago by ATpoint11k

Sure @dhbradshaw. I should have done that. As we have got alternative for TrimGalore (Skewer). I would like to omit it now.

Trinity is for de novo reconstruction of transcriptomes from RNA-seq data. We can mention Threads & memory in it. But still, it takes too much time even on a machine with 256GB memory & 32 cores. It took almost around 48-60 hrs to process 600GB data.

MAKER is basically used for de novo annotation of newly sequenced genomes. It runs HMM modelling in the background. I guess that's might be one of the reason for slowing it down. Threading is available for this too..

ADD REPLYlink written 13 months ago by anu014160
2
gravatar for ATpoint
13 months ago by
ATpoint11k
Germany
ATpoint11k wrote:

SAMtools mpileup. Takes about 24h for a 50X WGS when directly piped into VarScan using a 2.6GHz Intel Xeon node. Multithreading would be a charm. Same goes for bam-readcount in case one uses the VarScan fpfilter.

ADD COMMENTlink written 13 months ago by ATpoint11k

Out of curiosity, how do you normally run mpileup? - piped into bcftools call? I recently got back to DNA-seq after a sabbatical of 4 years away from it.

ADD REPLYlink modified 13 months ago • written 13 months ago by Kevin Blighe33k
1

I use the mpileup-Varscan combination:

$SAMTOOLS mpileup -q 20 -Q 25 -B -d 1000 -f hg38.fa normal.bam tumor.bam | $VARSCAN somatic /dev/stdin ${OUTNAME} -mpileup --strand-filter 1 --output-vcf
ADD REPLYlink written 13 months ago by ATpoint11k
1
gravatar for Chris Miller
13 months ago by
Chris Miller20k
Washington University in St. Louis, MO
Chris Miller20k wrote:

Here's another one: Bam-readcount, which is fairly fast, until you try to get counts/info for 3M SNPs from a WGS bam file https://github.com/genome/bam-readcount

ADD COMMENTlink written 13 months ago by Chris Miller20k
1
gravatar for Vijay Lakhujani
13 months ago by
Vijay Lakhujani3.4k
India
Vijay Lakhujani3.4k wrote:

miRanda for miRNA target prediction! Though, there is a CUDA implementation available but I don't think it's possible for everyone.

ADD COMMENTlink modified 13 months ago • written 13 months ago by Vijay Lakhujani3.4k
1
gravatar for Puli Chandramouli Reddy
13 months ago by
Pune, India
Puli Chandramouli Reddy150 wrote:
  1. genome assembly tools (de novo)
  2. sequence alignment tools (such as bowtie)
ADD COMMENTlink written 13 months ago by Puli Chandramouli Reddy150
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1009 users visited in the last hour