Question: Benchmarking bioinformatics tools
8
gravatar for h.mon
3.1 years ago by
h.mon19k
Brazil
h.mon19k wrote:

I want to profile / benchmark bioinformatics tools in general (CPU time, memory usage, disk usage), but specifically to do so accurately for pipelines, where a "master" script has lots of calls to other programs. Currently I am using GNU time, but I do not know how reliably it is for these cases.

Are there other tools / procedures suggestions?

P.S.: I know this is not strictly speaking a bioinformatics question, but I believe it is of general interest to the community. If moderators disagree, close the question without mercy.

edit: removed Profiling from title and text, it is not (upon learning what "profiling a software" means) what I want.

benchmarking • 2.1k views
ADD COMMENTlink modified 3.1 years ago • written 3.1 years ago by h.mon19k
2

Valgrind massif can analyze heap and stack memory usage:

$ valgrind --tool=massif --stacks=yes some-binary ...

It is also useful for debugging memory leaks in C applications.

It can slow down the application runtime considerably, however, so you would not want to run this in conjunction with time benchmark tests.

ADD REPLYlink written 3.1 years ago by Alex Reynolds25k

I plan to perform the benchmarks on a regular basis, so Valgrind isn't a good choice due to the slowdown. In addition, it seems very complex, I just need some quick and reasonable benchmarks.

I actually learnt (thanks to a great response which however is not showing here) that I probably misused "profiling", I intend to just benchmark stuff.

ADD REPLYlink written 3.1 years ago by h.mon19k
1

I use memtime to capture memory and time information.   It appears to correctly capture time even when local subprocesses are launched, but memory appears to not be correctly captured in that case - so a shellscript launcher just shows negligible memory usage.  Also, with Java programs, it's really hard to figure out how much memory they actually *need* except by trial and error, since they use however much you tell the JVM to use.  For that reason I've modified some of my programs to print how much memory the JVM thinks it's using, but that's still not quite correct, and hardly valid in a benchmark anyway.

And disk I/O is really difficult, since for example all I really care about are reads/writes to the network FS, but most tools I've seen that try to track I/O lump that together with cached reads, local disk, and even with pipes (e.g. "cat x | gzip > y" would measure at least double the I/O that it should).

So if anyone has a good universal solution, I'm interested as well.

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by Brian Bushnell15k
1

If you test anything involving disk I/O, clear caches in between tests. On Linux, for instance:

http://unix.stackexchange.com/a/87909

Failing to clear caches is a common way for developers to claim (dishonestly) that their binaries run faster, when in fact they don't.

ADD REPLYlink written 3.1 years ago by Alex Reynolds25k

Which memtime do you use? I found three: this very old one; this git which seems to be a continuation of the first link; and this git, a perl script.

Yes, benchmarking java applications is a pain, I recently benchmarked (using GNU time) ECC using Xmx4g, 8, 16, 32 and Xmx64g. In all cases the benchmark reported all memory given to the JVM was used (at least at some point), but running times were very similar. I wonder, will the JVM release some memory in case some competing process needs memory? In this particular case for ECC, it seems 4Gb are enough and more is just waste, so would the JVM be kind and use just the necessary?

ADD REPLYlink written 3.1 years ago by h.mon19k

We use the version at the first link, http://www.update.uu.se/~johanb/memtime/.

As for ECC...  it's a bit unusual.  Most Java programs will immediately grab virtual memory as specified by -Xmx but only use physical memory as needed.  But ECC stores kmer counts in a count-min sketch, which is inefficient to resize dynamically, so that program allocates all possible physical memory immediately (or if using a prefilter, eventually) even if the input is only 1 read.  ECC will never run out of memory, rather the accuracy is reduced as the structures become increasingly full, so a higher value of -Xmx allows increased accuracy.  As a result, I recommend running it with all available memory.  But for a bacterial isolate, -Xmx1g is probably sufficient for good accuracy.  High accuracy is important for error-correction, but not very important for normalization, which the tool also does.

Tadpole, on the other hand, also does error-correction.  It stores kmer counts exactly and as a result can run out of memory if the input is too big and -Xmx is too low.  But, it uses resizable hashtables, and as a result the physical memory consumption will grow over time, only as needed.

ADD REPLYlink written 3.1 years ago by Brian Bushnell15k
1
gravatar for kloetzl
3.1 years ago by
kloetzl970
European Union
kloetzl970 wrote:

I also use GNU time with just a small wrapper script for convenience.

#!/bin/bash

N=10

if [ "$1" = "-n" ]; then
    N=$2;
    shift;
    shift;
fi

for ((a=1; a <= N ; a++))
do
    (/usr/bin/time -f "%e %M %P" $* > /dev/null) 2>&1 | tail -n 1
done | awk '{at+=$1;t[NR]=$1;as+=$2;s[NR]=$2;ap+=$3;p[NR]=$3}END{at/=NR;as/=NR;ap/=NR;for(i=1;i<=NR;i++){b=t[i]-at;sdt+=b*b;b=s[i]-as;sds+=b*b;b=p[i]-ap;}sdt/=(NR-1);sds/=(NR-1);sdp/=(NR-1);print "time (seconds):", at,"±", sqrt(sdt);print "mem (kbyte):",as,"±", sqrt(sds);print "cpu usage:", ap, "±", sqrt(sdp);}'
ADD COMMENTlink written 3.1 years ago by kloetzl970
1
gravatar for Ying W
3.1 years ago by
Ying W3.8k
South San Francisco, CA
Ying W3.8k wrote:

If you are able to dedicate an entire node to the task,

Run Nmon for cpu and memory logging

Disk is very difficult to log, see: https://helgeklein.com/blog/2013/03/the-impossibility-of-measuring-iops-correctly/

ADD COMMENTlink written 3.1 years ago by Ying W3.8k
0
gravatar for h.mon
3.1 years ago by
h.mon19k
Brazil
h.mon19k wrote:

Thanks for all the answers. In addition to the alternatives pointed, I found pyrunlim, which also seems a good alternative. It took me a while to get it to work, as it uses an old psutil API. The psutil API changes are well documented and very simple, though, so I was able to get it working (of course, using an old psutil, e.g. 2.1.1, with the original pyrunlim script, will also work).

I am currently testing the alternatives to see which one fits my needs better, and possibly to validate the results against each other.

ADD COMMENTlink written 3.1 years ago by h.mon19k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 812 users visited in the last hour