The venerable FastQC is many scientists' first choice to generate that first look of a sequencing data. The software created in 2010 runs as a simple command line or GUI tool and generates various statistics plots. It is simple and fast, the plots look good although many plots show quantities that are easy to misunderstand.
For example read qualities are grouped for longer reads, values are mysteriously normalized to 100% leading to wrong conclusions by those that don't notice the finer details. Moreover the software is not suited for paired end read analysis and the reporting mode is unwieldy when running on dozens of samples.
So naturally I was intrigued when noticing two new approaches published recently:
- HTQC: a fast quality control toolkit for Illumina sequencing data published in BMC Bioinformatics, 2013
- NGSUtils: a software suite for analyzing and manipulating next-generation sequencing datasets published in Bioinformatics, 2012
This post is an evaluation of how they work based on first hand experiences performed this morning, starting from a basic task of having to evaluate a dataset with 24 million reads. Let me note that these happen to be really short reads by today's standards at only 40 bases long.
My benchmark is what I would normally do:
$ time fastqc data/28824.fq ... removed ... It seems our guess for the total number of records wasn't very good. Sorry about that. ... removed ... real 2m25.152s user 2m27.437s sys 0m2.023s
Let me get something off my chest here. I have seen that apology above so many times - I can't recall the last time I did not see it. The only effect it has on me is to make me wonder: just why exactly is it so difficult in guess the number of lines? After all each line has the exact same length and is composed of ASCII letters.
Oh well, why am I even mentioning that ... fastqc runs well in 2 minutes 25 seconds, generates quite a few plots and the maximum amount of memory used was about 180MB
Now onto bigger and better tools that will blow this old whale out of the water.
HTQC: turns out it needs CMake to install. Well CMake is supposed to
make (pun intended) life easier, alas in my experience in only means more trouble. Sure enough the version of my current CMake is not good enough, it needs a higher version. Now I need to download and install that. Manually of course since the package manager for the so called Scientific Linux does not have that. Thanks a bunch. Ok done. After that it compiles fine.
NGSUtils: I do a
git clone and type
make, it goes to town furiously downloading and compiling a lot of resources that already exist on my computer. But it does finish its job although leaves me with an uncertain feeling as to whether or not it has modified anything that is already there.
The paper for HTQC claims it to be three times FASTER than FastQC and claims to use a lot less memory as well.
Let's run it. Turns out you really need the -q option otherwise there is a message printed every 5000 lines. I must say default behavior like is not all that reassuring.
time ~/src/htqc-0.11.1/build/ht_stat -q --out report data/28824.fq real 5m58.007s user 5m48.289s sys 3m16.106s
The observed runtime is more than two times SLOWER !!!, and while running the program used 1.9GB !!! of memory.
Alas there is more, this does not actually generate plots only datasets. To get the plots one needs to run a separate program that invokes gnuplot. Great I have that already installed. Running the tool fails with a mysterious error "font not a valid variable". Internet sleuthing indicates that this error occurs when making use of features that are only available in the latest GnuPlot version 2.6. My package manager does not have this version (of course) of so it needs to be installed manualy. Oh well, I did that too but my patience is running thin. Run already:
time ~/src/htqc-0.11.1/ht_stat_draw.pl --dir report ???
The process does not seem to finish! Some plots are generated but the command does not return. The plots that the command generates are very ugly and look wrong, the bases extend to the 100 range even though the reads are only 40 bases long. So far it does not look good at all.
Let's try to other contender: NGSUtils has a command called
fastqutils stats we invoke it like so:
~/src/ngsutils/bin/fastqutils stats data/28824.fq
Facepalm moment ensues, this script prints information to the standard output for every single read that it investigates. The interface is made to look slick. There is a little rotating pipe that show that the program runs, the name of each read is printed and it continuously computes information such as the ETA and percent done. It is also insane slow because the speed at which this tool runs is equal to the speed of writing characters to the screen. No wonder the ETA indicates 24 minutes.
So there you have: it two recently published tools each claiming to do something better whereas in practice they are immensely inferior to much older tools and techniques. Perhaps Fred was onto something A farewell to bioinformatics