I want to profile / benchmark bioinformatics tools in general (CPU time, memory usage, disk usage), but specifically to do so accurately for pipelines, where a "master" script has lots of calls to other programs. Currently I am using GNU time, but I do not know how reliably it is for these cases.
Are there other tools / procedures suggestions?
P.S.: I know this is not strictly speaking a bioinformatics question, but I believe it is of general interest to the community. If moderators disagree, close the question without mercy.
edit: removed Profiling from title and text, it is not (upon learning what "profiling a software" means) what I want.
Valgrind massif can analyze heap and stack memory usage:
It is also useful for debugging memory leaks in C applications.
It can slow down the application runtime considerably, however, so you would not want to run this in conjunction with time benchmark tests.
I plan to perform the benchmarks on a regular basis, so Valgrind isn't a good choice due to the slowdown. In addition, it seems very complex, I just need some quick and reasonable benchmarks.
I actually learnt (thanks to a great response which however is not showing here) that I probably misused "profiling", I intend to just benchmark stuff.
I use memtime to capture memory and time information. It appears to correctly capture time even when local subprocesses are launched, but memory appears to not be correctly captured in that case - so a shellscript launcher just shows negligible memory usage. Also, with Java programs, it's really hard to figure out how much memory they actually need except by trial and error, since they use however much you tell the JVM to use. For that reason I've modified some of my programs to print how much memory the JVM thinks it's using, but that's still not quite correct, and hardly valid in a benchmark anyway.
And disk I/O is really difficult, since for example all I really care about are reads/writes to the network FS, but most tools I've seen that try to track I/O lump that together with cached reads, local disk, and even with pipes (e.g.
cat x | gzip > y
would measure at least double the I/O that it should).So if anyone has a good universal solution, I'm interested as well.
If you test anything involving disk I/O, clear caches in between tests. On Linux, for instance:
http://unix.stackexchange.com/a/87909
Failing to clear caches is a common way for developers to claim (dishonestly) that their binaries run faster, when in fact they don't.
Which memtime do you use? I found three: this very old one; this git which seems to be a continuation of the first link; and this git, a perl script.
Yes, benchmarking java applications is a pain, I recently benchmarked (using GNU time) ECC using Xmx4g, 8, 16, 32 and Xmx64g. In all cases the benchmark reported all memory given to the JVM was used (at least at some point), but running times were very similar. I wonder, will the JVM release some memory in case some competing process needs memory? In this particular case for ECC, it seems 4Gb are enough and more is just waste, so would the JVM be kind and use just the necessary?
We use the version at the first link, http://www.update.uu.se/~johanb/memtime/.
As for ECC... it's a bit unusual. Most Java programs will immediately grab virtual memory as specified by
-Xmx
but only use physical memory as needed. But ECC stores kmer counts in a count-min sketch, which is inefficient to resize dynamically, so that program allocates all possible physical memory immediately (or if using a prefilter, eventually) even if the input is only 1 read. ECC will never run out of memory, rather the accuracy is reduced as the structures become increasingly full, so a higher value of-Xmx
allows increased accuracy. As a result, I recommend running it with all available memory. But for a bacterial isolate, -Xmx1g is probably sufficient for good accuracy. High accuracy is important for error-correction, but not very important for normalization, which the tool also does.Tadpole, on the other hand, also does error-correction. It stores kmer counts exactly and as a result can run out of memory if the input is too big and -Xmx is too low. But, it uses resizable hashtables, and as a result the physical memory consumption will grow over time, only as needed.