Tutorial: Fastq Quality Control And Reporting - Aka Fastqc Versus The New Contenders
21
gravatar for Istvan Albert
4.6 years ago by
Istvan Albert ♦♦ 73k
University Park, USA
Istvan Albert ♦♦ 73k wrote:

The venerable FastQC is many scientists' first choice to generate that first look of a sequencing data. The software created in 2010 runs as a simple command line or GUI tool and generates various statistics plots. It is simple and fast, the plots look good although many plots show quantities that are easy to misunderstand.

For example read qualities are grouped for longer reads, values are mysteriously normalized to 100% leading to wrong conclusions by those that don't notice the finer details. Moreover the software is not suited for paired end read analysis and the reporting mode is unwieldy when running on dozens of samples.

So naturally I was intrigued when noticing two new approaches published recently:

This post is an evaluation of how they work based on first hand experiences performed this morning, starting from a basic task of having to evaluate a dataset with 24 million reads. Let me note that these happen to be really short reads by today's standards at only 40 bases long.

My benchmark is what I would normally do:

$ time fastqc data/28824.fq 
... removed ...    
It seems our guess for the total number of records wasn't very good.  Sorry about that.
... removed ...
real    2m25.152s
user   2m27.437s
sys   0m2.023s

Let me get something off my chest here. I have seen that apology above so many times - I can't recall the last time I did not see it. The only effect it has on me is to make me wonder: just why exactly is it so difficult in guess the number of lines? After all each line has the exact same length and is composed of ASCII letters.

Oh well, why am I even mentioning that ... fastqc runs well in 2 minutes 25 seconds, generates quite a few plots and the maximum amount of memory used was about 180MB

Now onto bigger and better tools that will blow this old whale out of the water.

Installation

HTQC: turns out it needs CMake to install. Well CMake is supposed to make (pun intended) life easier, alas in my experience in only means more trouble. Sure enough the version of my current CMake is not good enough, it needs a higher version. Now I need to download and install that. Manually of course since the package manager for the so called Scientific Linux does not have that. Thanks a bunch. Ok done. After that it compiles fine.

NGSUtils: I do a git clone and type make, it goes to town furiously downloading and compiling a lot of resources that already exist on my computer. But it does finish its job although leaves me with an uncertain feeling as to whether or not it has modified anything that is already there.

Runtimes

The paper for HTQC claims it to be three times FASTER than FastQC and claims to use a lot less memory as well.

Let's run it. Turns out you really need the -q option otherwise there is a message printed every 5000 lines. I must say default behavior like is not all that reassuring.

time ~/src/htqc-0.11.1/build/ht_stat -q --out report data/28824.fq
real    5m58.007s
user    5m48.289s
sys    3m16.106s

The observed runtime is more than two times SLOWER !!!, and while running the program used 1.9GB !!! of memory.

Alas there is more, this does not actually generate plots only datasets. To get the plots one needs to run a separate program that invokes gnuplot. Great I have that already installed. Running the tool fails with a mysterious error "font not a valid variable". Internet sleuthing indicates that this error occurs when making use of features that are only available in the latest GnuPlot version 2.6. My package manager does not have this version (of course) of so it needs to be installed manualy. Oh well, I did that too but my patience is running thin. Run already:

time ~/src/htqc-0.11.1/ht_stat_draw.pl --dir report
???

The process does not seem to finish! Some plots are generated but the command does not return. The plots that the command generates are very ugly and look wrong, the bases extend to the 100 range even though the reads are only 40 bases long. So far it does not look good at all.

Let's try to other contender: NGSUtils has a command called fastqutils stats we invoke it like so:

~/src/ngsutils/bin/fastqutils stats data/28824.fq

Facepalm moment ensues, this script prints information to the standard output for every single read that it investigates. The interface is made to look slick. There is a little rotating pipe that show that the program runs, the name of each read is printed and it continuously computes information such as the ETA and percent done. It is also insane slow because the speed at which this tool runs is equal to the speed of writing characters to the screen. No wonder the ETA indicates 24 minutes.

So there you have: it two recently published tools each claiming to do something better whereas in practice they are immensely inferior to much older tools and techniques. Perhaps Fred was onto something A farewell to bioinformatics

tutorial fastqc qualitycontrol • 7.6k views
ADD COMMENTlink modified 4.6 years ago by assafgordon140 • written 4.6 years ago by Istvan Albert ♦♦ 73k
6

This is also informative about the absurdly low bar set for tool papers. This makes me question the viability of tool papers altogether.

ADD REPLYlink written 4.6 years ago by Jeremy Leipzig17k
1

My first tool-related manuscript, which was rejected, was even worse than HTQC and ngsutils. I had invested quite a lot of time and thought I was doing for good, but when I looked back, my that tool is just something that I criticize every day. You are right that we should be cautious of tool papers, but most of these authors are trying to write something useful and more importantly they need a publication to graduate.

ADD REPLYlink written 4.6 years ago by lh330k
2

One could argue that the responsibility is on the reviewers - but then they are not directly rewarded for a job well done and the incentives to do be thorough are lacking. So in the end it is the system that is to blame - too many people need to write too many papers to move ahead on the pre-determined social ladder.

ADD REPLYlink written 4.6 years ago by Istvan Albert ♦♦ 73k

agreed, and sometimes a tool I've written makes complete sense to me and runs flawlessly, but others seem to find a way to use it in a way that I haven't foreseen (or cared about) so to them it is frustrating and useless. (try installing scipy if you're not a seasoned linux user)

ADD REPLYlink written 4.6 years ago by brentp22k

I do agree. How many times (I am not going to cite the tools) in my team we have been lost into tool "bugs". We have spent a incredible amount of time running a tool to see it crash in the end, and for some tools it could fail after one week. You also sometimes report this to the developer and you suddenly learn that there are quite a few things to improve to make it work really properly. Well, thank you very much. But sure, the paper claimed that it was going to be the inevitable tool of the year.

ADD REPLYlink written 4.6 years ago by toni2.1k
2

I do not like CMake, either. The right building system should not require end users to install anything but the basic unix tools. Also, developers should provide Linux binary for tools that depends on non-standard libraries or is complex to build.

ADD REPLYlink written 4.6 years ago by lh330k

Thanks for the checking them out. Another way to look at raw data tileQC. It produces tiles and one can quickly look for the spots which are of bad quality. This is just useful to have a glimpse of data as a whole (good/bad). Its available as R and Python version (python is supposedly faster)

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by Sukhdeep Singh9.0k

have you run this yourself? seems to create a mysql database for each dataset - I don't see that as feasible for today high throughput

ADD REPLYlink written 4.6 years ago by Istvan Albert ♦♦ 73k

Yes, but I have ran the R version, Didn't created that there.

ADD REPLYlink written 4.6 years ago by Sukhdeep Singh9.0k

Thanks for the post - very informative and good to see that sticking with FastQC out of convenience is also the smarter choice :-)

ADD REPLYlink written 4.6 years ago by Sebastian Kurscheid300
13
gravatar for assafgordon
4.6 years ago by
assafgordon140
assafgordon140 wrote:

Regarding I/O performance in FASTQ (and other) tools

My name was (wrongly) mentioned, so I feel fine to chime in.

First - many thanks to Aaron for his kind words, but FastQC was written by Simon Andrews from The Babraham institute. I developed other tools called "fastx-toolkit" long ago, but those were not mentioned in the above post (they are pretty fast, though :) ).

Now, when it comes to processing FASTQ files:

  1. They are usually huge - meaning lots of I/O
  2. The are sometimes compressed, which complicates matters.
  3. Their format is (almost) fixed simple, except when it's not.

On top of that, there's the usual problem that when you write code, it's better if you actually know how to program, and understand the intricate details of what's going on. Sadly, many "bioinformaticians" are more "informaticians" and less programmers.

With the two programs that you've mentioned:

"htqc" is written in CPP, trying pretty OO design, but falls short when it does I/O. The main I/O loop in "htqc-0.12.0-source/htio/PlainFileHandle.cpp" is this:

while ((curr_char = fgetc(handle)) != EOF)
{
     if (curr_char == LINE_SEPARATOR) break;
     buffer.push_back(curr_char);
}

Meaning it reads the (huge) FASTQ file character-by-character, and adds it to std::vector. Sorry - this will never perform well.

"NGSUtils" is written in Python (yuck) which already means performance will not be great. I'm a Perl person, but I don't claim Perl will do better. With raw I/O and simple logic (as in most simlpe Fastq stats), you'll want to do it in a more low-level programming language.

But regardless of Python, the main I/O loop again falls short. In "./ngsutils-ngsutils-0.5.0c/ngsutils/fastq/__init__.py", the loop is:

while True:
    try:
        name = self.fileobj.next().strip()[1:]

        spl = re.split(r'[ \t]', name, maxsplit=1)
        name = spl[0]
        if len(spl) > 1:
            comment = spl[1]
        else:
            comment = ''

        seq = self.fileobj.next().strip()
        self.fileobj.next()
        qual = self.fileobj.next().strip()

        If eta:
            eta.print_status(name)
        yield FASTQRead(name, comment, seq, qual)

        except:
            break

Which means that on top of Python's (slow) I/O, it does string.strip() on three lines and a regular-expression string.split() every four lines. This again - will not perform well, ever. Also checking "eta" and printing the progress report on EVERY read is wasteful.

As a programmer, I understand the temptation to use high-level constructs (e.g. regexp split) to do something useful in very few lines, but one has to understand that it comes with a price. No doubt using something like C's "strpos()" to find the space in the ID string is more cumbersome, but it's WAY faster.

Sadly, Peer-reviewed bioinformatic journals don't do proper code-review - so sub-par code is regularly published.

Regarding your (justified) rant: "just why exactly is it so difficult in guess the number of lines? After all each line has the exact same length and is composed of ASCII letters" ?

1. Not all lines have the same length: The ID line is subtly changed, because they lane/tile/X/Y values are reported as numbers. e.g.:

@HWI-M1:36:000000000-A2R1H:1:1101:14988:1368 1:N:0:NTCAC
@HWI-M1:36:000000000-A2R1H:1:109:99:7 1:N:0:NTCAC
@HWI-M1:36:000000000-A2R1H:1:432:9:1111 1:N:0:NTCAC

So while you can approximate the size based on the average size of the first 4 lines (or the first 1000 lines) and divide by the file size, it won't be accurate.

2. Most FASTQ files are stored compressed. Many tools can either read them and automatically uncompress them, or used with pipes - so they don't know the FASTQ file size in advance.

3. Even now and them you encounter a corrupt file - trucated during upload/download, or bad disk, or whatever - in those cases it's better to have a tool that scans the entire file and verifies that everything is OK (I got many "bug report" emails that turned out to be bad input files).

So while a tool that does guestimate the number of reads is possible - I guess nobody bothered to write one.

As a shameless plug, I wrote a tool that exactly that (just count reads in FASTQ files), and it's fast enough for me (not necessarily faster than other tools): https://github.com/agordon/fastx_read_count

Regards,

-gordon

ADD COMMENTlink modified 4.6 years ago • written 4.6 years ago by assafgordon140
1

Using fgetc() is bad, but it is not that bad. On my machine, reading a 4-million-line fastq (only reading without parsing) with fgets() takes 0.19 sec and with fgetc() takes 1.48 sec; fastq_to_fasta takes 7.60 sec. The bottleneck is not at fgetc/fgets - even if we use fgetc(), the difference would be 8.88 vs. 7.60 sec, which is not much. I do not have a good profiler installed on that machine. I guess fastq_to_fasta's bottleneck lies in chomp(), quality conversion and error checking. For efficient fasta/q processing, every loop through sequences/qualities, even memcpy(), hurts performance. Probably when you remove those loops, fastx_to_fasta will be much faster. For example, seqtk, which is less robust but support multi-line fasta/q, converts the same fastq to fasta in 0.40 sec. EDIT: also about the I/O speed of perl/python. Without error checking, a minimal fastq2fasta converter in perl take 1.37 sec. Perl/python I/O itself is not slow, but improper OOP layer may be slow as you said, and loops are particularly hurting.

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by lh330k

It's not that bad, but it could be better :)

However, you are refering to two diffrent issues:

  1. fgetc() by itself is not that bad, but even with your profling: 0.19s vs 1.48s is almost x7 slower - true that 1sec of real-world-time is not alot - but it's still slower. But worse - every step of the above loop does a "push_back" on an std::vector - that's really wasteful.

  2. fastq_to_fasta not only does "chomp" (which is also wasteful), but also verify character-by-character of every read - the ID, the nucleotide string, and the quality scores - this is the real source of the slowness - but for me it's acceptable, because I detected many corrupted FASTQ files right at the beggining, instead of using them with other programs.

I guess it's a balancing act: If you want something that's easy to program (e.g. Perl/Python) - you'll get slower performance. If you want something that's fast - you'll need to do more coding. But in both cases, better code would lead to better performances.

For the ultimate performance, one actually has to take it a step further, and read an entire block (e.g. 4KB), and do the line parsing in the program, instead of using stdio to read line-by-line (for example, GNU sort does this). But the amount of code required is big and error prone. I wouldn't want to go that way in my programs, as the benefits are low (IMHO).

Update: I just noticed you actually implemented a whole block-like reading framework in seqtk - kudos! and that definitely shows great performance. But you'll agree that your ks_getc() and kseq_read() macros is orders-of-magnitude more advanced than a naive implementation of fgetc() - a programmer who uses fgetc() will not get as good as performance as seqtk (I think).

ADD REPLYlink modified 4.6 years ago by Istvan Albert ♦♦ 73k • written 4.6 years ago by assafgordon140

fgets() has an internal buffer, which defaults to at least 4kb. It reaches nearly the optimal speed for line reading. You can hardly write something much faster. I didn't use fgets() because I need gzip support and more control when parsing multi-line fastq - not because it is slower. For 4-line fastq, fgets() works well. As to fgetc(), it uses the same buffer. My ks_getc() won't be faster. In all, my point is we cannot say a program is inefficient just because it uses fgetc() and push_back(). The bottleneck is frequently not at fgetc(). I can write a fastq-to-fasta converter with fgetc() and push_back() that will not be as fast as seqtk, but will be faster (without sophisticated error checking of course) than many existing tools.

ADD REPLYlink modified 4.6 years ago • written 4.6 years ago by lh330k

my bad regarding the line sizes, as you point it out more information is embedded and that varies the size of each line. Thanks for the great overall analysis.

ADD REPLYlink written 4.6 years ago by Istvan Albert ♦♦ 73k

My apologies for the toolname confusion...

ADD REPLYlink written 4.6 years ago by Aaronquinlan9.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1193 users visited in the last hour