Forum: Survey/Vote: If you could double the speed of any three commandline tools, which three would they be?
5
gravatar for dhbradshaw
14 days ago by
dhbradshaw110
United States
dhbradshaw110 wrote:

Any thoughts? I'm posting here to look for bioinformatics tools that would benefit from a speed up. But if it's more general command-line tools that tend to be bottlenecks, list those too.

If you want to list more or less than three, that's fine too, of course. If possible, prioritize them in order of impact on your daily life. I'm hoping this can be a resource for all those who, like me, might be interested in helping to speed up some tools.

ADD COMMENTlink modified 13 days ago by Puli Chandramouli Reddy120 • written 14 days ago by dhbradshaw110
4

One thing to consider is that for lots of operations on "big data" sets, disk I/O is the limiting factor (or even network speed), so speeding up the code may not help!

ADD REPLYlink written 14 days ago by Chris Miller19k

Architecture speedups are just as interesting and challenging as code speedups. So maybe I/O bound applications are worth listing as well.

ADD REPLYlink written 14 days ago by dhbradshaw110

But they can potentially cost significantly more :)

BTW: You must have something on your list, why not include those programs in your original post.

ADD REPLYlink modified 14 days ago • written 14 days ago by genomax37k

I don't have anything on my list; that's the problem!

I'm a physicist turned coder (like half the world, it seems). Outside of Rosalind and Coursera, I've never done any bioinformatics. But I've learned that I like speeding things up and I thought it would be fun to find out where I could make a practical contribution using that interest.

However, if it turns out that there aren't any tools that people wish were faster, then that's useful data too.

ADD REPLYlink written 14 days ago by dhbradshaw110
2

As @Chris has already indicated a lot of high throughput sequence data requires massive I/O which ultimately turns out to be a bottleneck.

There are many software packages that probably use routines that can use optimization from a master coder to make them run more efficiently (e.g. see this recent call from a fellow BioStar Call for requirement: a fast all-in-one FASTQ preprocessor (in C++, multi-threaded) ). Variant callers take a long time to process data and may be good candidates (freebayes is one example).

ADD REPLYlink modified 14 days ago • written 14 days ago by genomax37k
1

I/O has indeed become one of the biggest holdups. The community gave a lot of attention to CPU parallelisation in recent years and now we have a lot of functions for which parallelisation is enabled, and it's also easy to implement on old code written for single thread/CPU processing via forking. It's feasible to analyse multiple whole genome samples in a single day these days.

That said, it still takes >24 hours to back-up 1.5 terabytes of data (my personal and work files since 2000).

As I'm Irish, I'll also throw some humour into the discussion: years ago (2013) I attempted to back up ~2 terabytes of data from a local heavy-duty hard-disk (whole genome seq) to a NFS 'storage' drive on the same network. Windows came back with a predicted time of 154 years for it to complete, at which point my own mortailty was instantly thrown into question.

ADD REPLYlink written 14 days ago by Kevin Blighe7.2k

Thanks for these two leads.

ADD REPLYlink written 14 days ago by dhbradshaw110

disk I/O is the limiting factor

That just what bad programmers want you to believe. ;)

ADD REPLYlink written 13 days ago by kloetzl700
1

dhbradshaw : As you can see we are all over the place with these wish lists.

If you are really a master programmer then your best bet is to have a developer of one of these tools contact you with very specific request(s) about accelerating parts of their code. If that will happen remains to be seen :-) I assume you have a limited amount of time available to dedicate to this sort of thing.

ADD REPLYlink modified 13 days ago • written 13 days ago by genomax37k

That path makes sense and is interesting.

ADD REPLYlink written 13 days ago by dhbradshaw110

I haven't claimed and will not claim to be a master developer.

But I don't think that's what's called for. What's called for is someone who will focus on a specific problem, read and get advice, think through the algorithm candidates and the effects of scaling on speed, make measurements, and iterate their way toward greater speed. It's more about work than about being a master developer. It's more about doing than about being.

It's important to realize that because if you don't then you miss out on opportunities to make contributions. And you don't grow toward "mastery" like you otherwise would either.

ADD REPLYlink written 13 days ago by dhbradshaw110
1

You have clearly done this before and are pragmatic. Hopefully you will be able to find an interesting project out of the several mentioned here to work on. If you are going to start new then something to do with variant calling/annotation would be immediately useful, as has been said elsewhere here. Let us know if you need introductory materials on that topic.

ADD REPLYlink written 13 days ago by genomax37k

Combining your feedback with that of Chris Miller and others points with a fairly high signal toward variant calling and annotation. So I'm on board.

And yes, please! I do indeed need introductory materials on everything variant related.

ADD REPLYlink written 8 days ago by dhbradshaw110

Do you have a preference for a language? Probably important to mention.

ADD REPLYlink written 14 days ago by WouterDeCoster23k

Rust -- I'd probably write anything over from scratch. So maybe smaller is better to start :-D

ADD REPLYlink written 14 days ago by dhbradshaw110
4
gravatar for Chris Miller
14 days ago by
Chris Miller19k
Washington University in St. Louis, MO
Chris Miller19k wrote:

In the interest of offering useful leads:

1) The neural networks used for MHC Class I and II epitope binding prediction. (sadly, most are not open source - see http://tools.iedb.org/mhci/ )

2) Variant annotation with VEP. It's an amazing tool, but is relatively slow.

3) Variant calling, especially for indels. Pindel, in particular, seems to often be a bottleneck in our pipelines.

ADD COMMENTlink modified 14 days ago • written 14 days ago by Chris Miller19k

re 2), Illumina's recently open-sourced Nirvana, which looks pretty promising. .NET is annoying, but it seems like a solid piece of software (and OSS should always be encouraged)!

ADD REPLYlink written 14 days ago by tpoterba20

Really interesting code and applications here. MHC binding application is broken form me -- the url needs the leading http://.

ADD REPLYlink written 14 days ago by dhbradshaw110
4
gravatar for 5heikki
14 days ago by
5heikki6.9k
Finland
5heikki6.9k wrote:

In my pipelines, more often than not, GNU sort is the bottle neck.

ADD COMMENTlink written 14 days ago by 5heikki6.9k

I wondered, would a linear sorting algorithm like radixsort improve that or are indexing and later random accessing the sorted lines the bottle neck. I mean, formats like bed or vcf have nicely defined columns that need to be sorted most of the times.

ADD REPLYlink written 14 days ago by Aerval240

I think with large files the main bottle neck is IO

ADD REPLYlink written 13 days ago by 5heikki6.9k

Do you use all the tricks to speed up sort? See discussion on Any Quick Sort Method For Large Bed File As 20G In Size?.

ADD REPLYlink modified 13 days ago • written 13 days ago by h.mon9.2k
2
gravatar for anu014
14 days ago by
anu014120
India
anu014120 wrote:

I would like to name these:

  1. TrimGalore (https://www.bioinformatics.babraham.ac.uk/projects/trim_galore ) - No parallel processing
  2. Trinity (https://github.com/trinityrnaseq/trinityrnaseq/wiki )
  3. MAKER (http://weatherby.genetics.utah.edu/MAKER/wiki/index.php/MAKER_Tutorial_for_GMOD_Online_Training_2014 )

I wish denovo tools such as last 2 were faster..

ADD COMMENTlink modified 14 days ago • written 14 days ago by anu014120

Can you annotate this list a little bit for outsiders like me? Also, I'm trying to understand priorities here since it looks like 1 should be the priority but then you seem to call out the last 2. Or are you saying that there are other de novo tools that you would list as well?

ADD REPLYlink written 14 days ago by dhbradshaw110
2

TrimGalore does quality trimming and adapter removal, but there are ultrafast and multithreaded alternatives out there, such as Skewer, so I think this is not the best type of tools to go for.

ADD REPLYlink written 13 days ago by ATPoint2.4k

Wow! It supports multi-threading. I wasn't aware of this tool.Thanks @ATPoint!

ADD REPLYlink written 13 days ago by anu014120

But doesn't it support only paired-end data? "Skewer: a fast and accurate adapter trimmer for next-generation sequencing paired-end reads" (https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-15-182)

ADD REPLYlink written 13 days ago by anu014120
1

No, it has a single-end mode, too. I think the authors wanted "paired" in the title, as they claim that it outperformed existing tools at that time in terms of paried-end accuracy.

ADD REPLYlink written 13 days ago by ATPoint2.4k

Sure @dhbradshaw. I should have done that. As we have got alternative for TrimGalore (Skewer). I would like to omit it now.

Trinity is for de novo reconstruction of transcriptomes from RNA-seq data. We can mention Threads & memory in it. But still, it takes too much time even on a machine with 256GB memory & 32 cores. It took almost around 48-60 hrs to process 600GB data.

MAKER is basically used for de novo annotation of newly sequenced genomes. It runs HMM modelling in the background. I guess that's might be one of the reason for slowing it down. Threading is available for this too..

ADD REPLYlink written 13 days ago by anu014120
2
gravatar for ATPoint
13 days ago by
ATPoint2.4k
Germany
ATPoint2.4k wrote:

SAMtools mpileup. Takes about 24h for a 50X WGS when directly piped into VarScan using a 2.6GHz Intel Xeon node. Multithreading would be a charm. Same goes for bam-readcount in case one uses the VarScan fpfilter.

ADD COMMENTlink written 13 days ago by ATPoint2.4k

Out of curiosity, how do you normally run mpileup? - piped into bcftools call? I recently got back to DNA-seq after a sabbatical of 4 years away from it.

ADD REPLYlink modified 13 days ago • written 13 days ago by Kevin Blighe7.2k
1

I use the mpileup-Varscan combination:

$SAMTOOLS mpileup -q 20 -Q 25 -B -d 1000 -f hg38.fa normal.bam tumor.bam | $VARSCAN somatic /dev/stdin ${OUTNAME} -mpileup --strand-filter 1 --output-vcf
ADD REPLYlink written 13 days ago by ATPoint2.4k
1
gravatar for Chris Miller
13 days ago by
Chris Miller19k
Washington University in St. Louis, MO
Chris Miller19k wrote:

Here's another one: Bam-readcount, which is fairly fast, until you try to get counts/info for 3M SNPs from a WGS bam file https://github.com/genome/bam-readcount

ADD COMMENTlink written 13 days ago by Chris Miller19k
1
gravatar for Vijay Lakhujani
13 days ago by
Vijay Lakhujani1.3k
India
Vijay Lakhujani1.3k wrote:

miRanda for miRNA target prediction! Though, there is a CUDA implementation available but I don't think it's possible for everyone.

ADD COMMENTlink modified 13 days ago • written 13 days ago by Vijay Lakhujani1.3k
1
gravatar for Puli Chandramouli Reddy
13 days ago by
Pune, India
Puli Chandramouli Reddy120 wrote:
  1. genome assembly tools (de novo)
  2. sequence alignment tools (such as bowtie)
ADD COMMENTlink written 13 days ago by Puli Chandramouli Reddy120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1342 users visited in the last hour