Estimate running time of a orthoFinder
2
0
Entering edit mode
6 weeks ago
dago ★ 2.7k

Hi All, I would like to run orthoFInder on ~2000 bacterial genomes. The authors report in the paper an estimated run-time for up to 256 fungal genomes. So, I believe this is not really applicable to my case. Hence, I was wondering whether someone has a good tip to estimate how long would this program (or more in general any proram) run.

Thank you

gene bacteria genome • 247 views
0
Entering edit mode

As programs might not scale linearily you could test by running with 1,2,3,4, etc genomes and then derive a simple equation to predict runtime.

0
Entering edit mode

This will highly likely depend a lot on the computer infrastructure that you have access to, and the system of the authors. RAM, CPU, disk speed, maybe network...

0
Entering edit mode

A: OrthoFinder running time one of the developer mentions that ability to use DIAMOND speeds things up significantly.

3
Entering edit mode
6 weeks ago

Some theoretical considerations will help to clarify this problem.

It is relatively straight forward to deduce that the problem complexity class is not linear but quadratic over the number of genomes n, or with some simplifications: Given g is proportional to genome size, and T_p(g) is the complexity of an algorithm comparing each pair of genomes, and all n genomes are of equal size, then an algorithm must compare all genomes against all other genomes, or T_o(n) = O(n^2) * T_p(g). Because a single pairwise comparison costs the same and genome sizes are fixed for our set, we can think of it as T_p(g) as a constant and forget about it for now: T(n) = O(n^2) Of course, this constant factor is what makes the main difference in runtimes and there is some large variability from implementation etc. We also abstract from the possibility that a heuristic could not need to look at all pairs completely.

Figure 4 M in the orthoPaper shows this quite well, note that both axes are of logarithmic scale. For example, if the problem is quadratic, for every doubling of n, the run-time quadruples or T(2n) ~ 4T(n). This is of course very approximate, whatever one algorithm has as an advantage might be gained by its heuristics, and in reality, not all genomes have equal size.

So, what does that mean for your example? 2000 ~ 2048 . For 2048 genomes, the total time could be about 64 times that of 256 genomes. With the running time of the paper that would give ~ 115 days. You might gain a little on the genome size dependent factor. Therefore it is important to measure run time by always doubling the input size (2,4,8,16) in the same way as in Figure 4 M. In fact you could simply do linear regression on the log-transformed values and extrapolate run-times using both log-scales.

0
Entering edit mode

It is surprising that no information about the hardware used is included in the paper. Only thing I could find was standard computer hardware without any specifics.

If money was no object then one could think of doing this analysis in parallel in cloud to speed things up substantially. Would require some benchmarking since the original paper is thin on hardware specifics.

0
Entering edit mode

The authors talk about using 16 parallel threads repeatedly. So I am guessing that they had a single 8-core CPU like an Intel Core i7 or i9. If the tasks were perfectly parallelizable, then still because of the quadratic nature of the problem one needs to overspend on the CPU side: one would need 64 times the CPU resources (8*64=512) from the paper to manage the only 8-fold larger problem in the same time. So it is essential to bring n down also, if possible.

To cluster the 2000 genomes and to combine very similar sequences could help.

2
Entering edit mode
6 weeks ago
Mensur Dlakic ★ 10k

The estimate for 256 fungal genomes is not good not only because the number of genomes and their sizes don't match to yours, but also because you most likely have a different computer configuration. Still, I think that as a ballpark value you can safely go with 5-10x of what they determined.

As @ATpoint told you, the way to do this properly is to take a fraction of data, and try to scale it to the whole dataset. The problem is that just any number of genomes will not do, especially if they have significantly different sizes. This is to say that you should try to select a good representatives of the whole dataset - not five smallest or five largest, but a number that is closer to the median size. I would not go with anything less than a 100 to estimate a dataset that has ~2000 genomes. Maybe try 10, 20, 50 and 100 that have good size representation, and see whether the scaling is linear. At that point you would have done less than 10% of all genomes as it relates to time investment, which I think is acceptable for a job of this size. You may realize that a faster computer or indeed a computer cluster is the only way to get this done reasonably fast.

Separately, I would guess that many of the genomes are similar, and some of them likely to be near-identical. You would considerably lighten the computational burden by clustering the genomes beforehand into a smaller subset. Chances are that whatever results you get for any given genome, it can be safely assumed that a different genome with similar size and 99% identity would yield the same result.