Question

Algorithms Accuracy Vs Compute Requirements

0

Entering edit mode

13.3 years ago

Monzoor ▴ 300

A lot of algorithms developed for NGS data (especially metagenomic) analysis need huge compute power. Moreoever there seems to be a race for developing algorithms for clusters, parallel versions etc.

Though I agree that compute power is not that astronomically expensive now, is it not true that there are minimal efforts to develop algorithms that can analyze millions of sequences on a modest desktop (say a 2GB RAM system)?

Where have all the computer scientists (classic algorithm developers) gone? I remember a client of mine saying "Forget the compute requirements, just ensure that the results are fine".

algorithm hardware • 3.7k views

ADD COMMENT • link updated 13.0 years ago by lh3 33k • written 13.3 years ago by Monzoor ▴ 300

2

Entering edit mode

As far as I know, TCS and especially research in algorithms is as healthy as ever. Whether people in the industry care about their results, or whether researchers themselves are enough inclined to bridge the gap between theory and practice, is another question.

ADD REPLY • link 13.3 years ago by Anthony Labarre ▴ 270

0

Entering edit mode

Sure Eric. Got your message. Will be careful next time. Any ways do you know of any nice fora where I can ask/discuss and get perspectives on such topics. Thanks in advance

ADD REPLY • link 13.3 years ago by Monzoor ▴ 300

0

Entering edit mode

From my talking with others, people who generate results for industry do not care too much about results, but those (e.g. service providers) who generate results for academy do care. In addition, it is worth noting that most successful/popular programs have bridged gap between theory and practice.

ADD REPLY • link 13.3 years ago by lh3 33k

Ram · Answer 1 · 2010-12-28

2

Entering edit mode

13.3 years ago

Gww ★ 2.7k

There are definitely efforts out there to build efficient algorithms with minimal hardware requirements. One great example is Bowtie, Langmead et al. put a lot of effort into the optimization of their alignment algorithms with minimal memory foot prints. The paper for their work really doesn't do the amount of optimization they've included justice (which is very unfortunate). Just browse through the source code a bit and you'll see. Actually, just look at the evolution of next generation sequence aligners. They went from being quite slow using hashes to using the burrows-wheeler transform (using some very interesting theoretical computer science).

Another great example is the HMMER tool by the Eddy lab. HMMER3 is very fast compared to previous versions and I can assume it took quite a while for them to get their heuristic optimizations tested and functioning properly.

Perhaps the biggest problem is that you are being impatient. Next generation sequencing focusing on metagenomics is a new field. I'm sure there will be some innovative algorithms developed in the future. Optimization can be very difficult to figure out. Then there is the testing and all of the proofs necessary to show that the optimization is indeed useful. Plus the manuscript has to be prepared, submitted, reviewed and so forth. So give it time, research can take a while but things will improve

ADD COMMENT • link 13.3 years ago by Gww ★ 2.7k

2

Entering edit mode

Actually HMMER3 and bowtie are opposite examples. HMMER3 is both more accurate and faster than HMMER2, while Bowtie trades accuracy for speed. In addition, BWT based algorithms are not superior to hash table based ones. Stampy, novoalign, gsnap, mosaik and even the old maq and eland are slower than bowtie because they are more accurate or have additional features (e.g. gapped alignment).

ADD REPLY • link 13.3 years ago by lh3 33k

1

Entering edit mode

Sorry. I should be more specific. When I say HMMER3 is both faster and more accurate, I mean two things: a) without heuristics, HMMER3 is faster and as accurate; b) the default setting of HMMER3 is more accurate because of the forward algorithm. It is possible that calibrated HMMER2 with forward algorithm is more accurate, although I have not seen a formal benchmark.

ADD REPLY • link 13.3 years ago by lh3 33k

0

Entering edit mode

I disagree HMMER3 is faster and more accurate than HMMER2. I also wasn't commented on the accuracy for bowtie, the poster was asking about speed and algorithm research regarding running these algorithms on minimal hardware requirements.

ADD REPLY • link 13.3 years ago by Gww ★ 2.7k

0

Entering edit mode

Google search brings me to this page. Nearly two years ago, HMMER3 was "much more sensitive than HMMER2" overall, although this does not address the specificity.

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 13.3 years ago by lh3 33k

Ram · Answer 2 · 2010-12-28

In my experiences, algorithm development is mostly about choosing a balance between features, accuracy, speed and memory. For every gain in one of these aspects, you lose in others. It is nearly impossible for a program to beat others in all metrics. All you should do is to choose the balance point you feel comfortable. For example, for ChIP-seq and CNV discovery, you do not need great accuracy. Bowtie would be the best. For the discovery of structural variations, you have to use a slow mapper to be very accurate.

However, there are exceptions. Very occasionally, smart scientists may achieve breakthroughs, when a new algorithm can excel in all aspects: features, accuracy, speed and memory. There are few breakthroughs in the area of sequence alignment I am familiar with. All I can think of are: sse2-sw, hmmer3 and bwt-sw.

As to metagenomics, my feeling is it has not attracted enough top programmers (I saw your previous question, but I agree with others that this is not true). Unless you have a really talented scientist in the field, competitions between multiple groups are necessary. The boost of short-read mappers and the recent improvement to javascript engines are among the best examples.

In addition, you should not always look for programs that can run on a desktop computer. As I said above, you often pay something else, mostly accuracy, as the trade-off. Most successful programs choose a balance point that fit a moderate workstation or server.

EDIT: If I had to work for a metagenomics project, I would have two choices: a) writing a quick-and-dirty pipeline integrating existing software developed for other purposes; b) spending most of my time on a new package tuned for the project. If I felt metagenomics was the future and everyone would go for it in a year, I would choose b); if I felt this was only a small field, I would choose a) because b) is not worth the time. I believe many developers/programmers are thinking the same way.

score 0 · Answer 3 · 2010-12-25

0

Entering edit mode

13.3 years ago

Bilouweb ★ 1.1k

The question is: why such algorithms need a huge compute power ?

If it is just a question of data size, then you can focus on the accuracy.

But when the problem itself is NP-complete (or NP-hard), and I think it is the case for some NGS problems, then the complexity is more than huge ! (for an exact solution)

If your dataset is of size N, then the number of operations can be N! to find the exact solution (focus on accuracy). Such a number of operations is too much. Unless you have billions of years.

In my work, I had a problem with 10^77 possible solutions to explore and no simple algorithm to cut the search space. Someone asked me: "but on big cluster, it is not a problem". In fact, it is a problem. If you have a million of computers, the complexity is still about 10^70 solutions. It is still too much.

That is why we use heuristic algorithms too find near optimal solutions.

ADD COMMENT • link 13.3 years ago by Bilouweb ★ 1.1k

3

Entering edit mode

An algorithm is not NP-complete; a problem is.

ADD REPLY • link 13.3 years ago by Anthony Labarre ▴ 270

0

Entering edit mode

I agree. Nevertheless, I still find that scientists now are not interested in developing algorithms (especially with respect to NGS data) that can smartly sift through thousands of solutions and get the end-user a near optimal solution using minimal compute power.

ADD REPLY • link 13.3 years ago by Monzoor ▴ 300

0

Entering edit mode

Perhaps because when you come from biology side, you are interested by the result, and when you come from computer side, you are interested in algorithms... The key is somewhere in the middle ;) But I am interested by such subject : solving hard computational problems in bioinformatics using the fastest algorithm.