Tutorial:Fastq Quality Control Shootout
2
73
Entering edit mode
8.9 years ago

This tutorial also serves as the supporting information for Lecture 9 of the course titled Analyzing High Througput Sequencing Data offered at Penn State.

Within this tutorial we compare the efficiency and characteristics of several software tools designed to filter FASTQ files. We choose two simple but common operations and we perform them with different tools.For some tasks we will provide what we call a naive implementation in various languages. These are implementations that would take a programmer about 5-10 minutes to create. These are for illustrative purposes only.

We want to make a note that the Fastx Toolkit exhibits a strange behavior in that the binary code downloaded for a Mac runs more than five times slower than the binary code compiled locally on the same computer. The times are displayed for the faster version of the Fastx Toolkit. For the NGS Toolkit we have disabled the (otherwise quite helpful) FastQ format detection subroutines.

The test data s1.fq contains 1 million FASTQ records.

All times are reported in seconds (s). Tools are ordered by runtime.

Clipping Sequences

Clipping is removing parts of each fastq record. For this test we'll remove 10 bases from the start and end.

2.5s with Seqtk:

seqtk trimfq -b 10 -e 10 data/s1.fq > data/tmp.fq


3.7s with Fastx Tookit

fastx_trimmer -Q33 -f 10 -l 80 -i data/s1.fq -o data/tmp.fq


4.3s with Trimmomatic

java -classpath trimmomatic-0.22.jar org.usadellab.trimmomatic.TrimmomaticSE data/s1.fq data/tmp.fq HEADCROP:10 CROP:90


6.4s with the Naive Python (no considerable runtime difference when running via the PyPy optimizer)

python naive_clip.py < data/s1.fq > data/tmp.fq
pypy naive_clip.py < data/s1.fq > data/tmp.fq


10.3s with NGS Toolkit:

perl TrimmingReads.pl -i data/s1.fq -l 10 -r 10 -o data/tmp.fq


19.3s with Prinseq:

perl prinseq-lite.pl -fastq data/s1.fq -trim_left 10 -trim_right 10 -out_good data/tmp.fq


Trimming by Quality

For this task sequences are removed or shortened to meet a quality criterion.

We attempted to perform an operation where we either remove reads that have an average quality of 30 or we trim back reads removing bases that have quality lower than 30. In this latter case we limit the trimming to maintain a length of at least 50 bp.

2.1s with Seqtk. Seqtk uses the so called modified Mott algorithm for trimming (trimming stops when hits the limit and no reads are ever discarded).

seqtk trimfq -l 50 data/s1.fq > data/tmp.fq


3.1 seconds with a custom lua program running under the lua JIT compiler (see comments for more info)

luajit lua/trim.lua 30 50 < data/s1.fq  > data/tmp.fq


4.8s with Trimmomatic filtering trimming back trailing bases with qualities of under 30 (removes 0.37% of reads):

java -classpath trimmomatic-0.22.jar org.usadellab.trimmomatic.TrimmomaticSE -phred33 data/s1.fq data/tmp.fq TRAILING:30 MINLEN:50


6.8s with the Fastx Tookit, this operation will only keep reads that are above 30 in at least 50% of bases

fastq_quality_filter -Q33 -q 30 -p 50 -i data/s1.fq -o data/tmp.fq


13.3s with the Naive Python that removes reads with an average quality that are lower than 30. Run via the PyPy JIT compiler.

pypy naive_trim.py < data/s1.fq > data/temp.fq


20.0s with cutadapt also implements a Mott type trimming:

cutadapt -q 30 -m 50 data/s1.fq > data/tmp.fq


20.5s with the NGS Toolkit triming back reads with quality 30 from the end (removes 0.37% of reads).

perl TrimmingReads.pl -i data/s1.fq -q 30 -n 50 -o data/tmp.fq


22.3s with the Naive Python version that removes reads with an average quality that are lower than 30

python naive_trim.py < data/s1.fq > data/temp.fq


37.8s with Biopieces trimming back bases until it reaches of strech of 3 bases of quality 30 (removes 0.56% of reads):

read_fastq -i data/s1.fq | trim_seq -m 30 | grab -e 'SEQ_LEN>=50' | write_fastq -o data/tmp.fq -x


85.6s with Prinseq removing bases with a quality lower than 30 (removes 0.37% of reads):

perl prinseq-lite.pl -fastq data/s1.fq -trim_qual_right 30 -min_len 50 -out_good data/tmp.fq


Closing thoughts

Each tool also provides other functions that we have not discussed at all. What we present here is a benchmark to give newcomers an idea about the relative strenghts and weaknesses of the various tools. When running a large number of tools there is always the risk of misusing one or more of them and thus concluding that the tool is inefficient or faulty. Please comment below with any comments/corrections.

We can only marvel at the efficiency with which Seqtk reads and processes data. When trimming it blazes through 1 million reads in a mere 2.1 seconds. In parallel the simplicity of invocation and documentation is nothing short of astounding.

Trimmomatic is also a noteworthy competitor with uncommon speed and efficiency. Alas the complexity of understanding the command line presents a heavy and unnecessary burden. The choice of ignoring the many decades of tradition when it comes to passing arguments to a program is also unfortunate. Should any program ever be called org.usadellab.trimmomatic.TrimmomaticSE? The answer is a resounding no.

At the other extreme we were very suprised by what we consider a subpar performance of the Prinseq tool. This is a software tool has a particularly pleasing webpage and documentation that extoll qualities that do not seem to be reflected in the perfomance of the tool itself.

Tool List

quality fastq trimming qc tutorial Tutorial • 29k views
10
Entering edit mode

TRUTH: "Should any program ever be called org.usadellab.trimmomatic.TrimmomaticSE? The answer is a resounding no."

2
Entering edit mode

A comparison of code written in different languages parsing FASTQ files can be found here: How To Efficiently Parse A Huge Fastq File?

1
Entering edit mode

Very good and straightforward tuto ! There is also SolexaQA for trimming: http://solexaqa.sourceforge.net/

0
Entering edit mode

1
Entering edit mode

added the tool - it takes about 20s which is pretty good considering that it has a far more sophisticated trimming algorithm than most

0
Entering edit mode

Thanks! Adding some characters to get the comment accepted...

0
Entering edit mode

I wrote a simple one in lua here that does either a moving window or a hard trim: https://gist.github.com/1420220

0
Entering edit mode

Freakishly fast with luajit: just 3.11 seconds!

0
Entering edit mode

Here is another one I just came across: reaper http://www.ebi.ac.uk/~stijn/reaper/reaper.html

0
Entering edit mode

Hi there,

In trimmomatic, when they say LEADING or TRAILING bases under a certain quality score. How many leading or trailing bases are they referring to? is it just one?

2
Entering edit mode

0
Entering edit mode

I have just been using Trimmomatic and the command line execution is now much simpler than in the example. Someone has gone to the effort of making it much easier to run!

0
Entering edit mode
0
Entering edit mode

0
Entering edit mode

Hi,I was use NGS QC Toolkit IlluQC.pl),and it keep throw the error :

Use of uninitialized value $avgQRangeF2 in split at /leofs/zhangz_group/xial/NGSQCToolkit_v2.3.3/QC/lib/html.pl line 25. Use of uninitialized value$seqFormatName in concatenation (.) or string at /leofs/zhangz_group/xial/NGSQCToolkit_v2.3.3/QC/lib/html.pl line 142.

0
Entering edit mode

I suggest you make a new post with this issue.  This isn't an answer to this original thread and people won't be looking here to help you.

0
Entering edit mode

OK thank you anyway !!!

0
Entering edit mode

Very informative! Sorry to comment on an aging post - but I think I might see an error:

Specifically re. "we attempted to perform an operation where we either remove reads that have an average quality of 30 or we trim back reads removing bases that have quality lower than 30" but the call to seqtk trimfq doesn't seem to specify the appropriate error probability threshold, wouldn't it be necessary to set "-q 0.001"? (Since the default is 0.05, which, if my understanding is correct, would correspond to a quality score  of ~13?)

1
Entering edit mode

yes there is a little inconsistency there, the way seqtk works the desired operation cannot be easily translated into parameters. I forgot the exact details but I believe it keeps a running sum of the error rather than considering each base individually so the concept of sharp cutoff does not apply.

0
Entering edit mode

Hi, can you make s1.fq available via FTP? Thanks.

2
Entering edit mode
8.5 years ago

Trim with Biopieces can be done like this:

read_fastq -i in.fq | extract_seq -b 10 | reverse_seq | extract_seq -b 10 | reverse_seq | write_fastq -o out.fq -x

or if you are working with fixed length reads simply:

read_fastq -i in.fq | extract_seq -b 10 -e 80 | write_fastq -o out.fq -x

(I really should upgrade extract_seq to use negative values to offset from the end).

I guess both will be slowish. However, Biopieces and GNU parallel work wonders. It would be interesting if you'd add benchmarks for that combination: http://code.google.com/p/biopieces/wiki/HowTo#Example_processing_a_FASTQ_file

1
Entering edit mode
8.9 years ago

The code for the "naive" implementations used above. This code requires that the sequence and quality appear on a single line:

Naive clipping with Python

import sys, string, itertools

stream = itertools.imap(string.strip, sys.stdin)

for id in stream:
print id
print stream.next()[10:-10]
print stream.next()
print stream.next()[10:-10]


Naive trimming with Python

import sys, string, itertools

def decode(x):
return ord(x) - 33

def average(text):
vals = map(decode, text)
return sum(vals)/float(len(text))

stream = itertools.imap(string.strip, sys.stdin)

for id in stream:
s, p, q = stream.next(), stream.next(), stream.next()
avg = average(q)
if avg > 30:
print "%s\n%s\n%s\n%s" % (id, s, p, q)

1
Entering edit mode

I'm aware this thread has just been resurrected, but I figured I'd just comment that these snippets are perhaps a bit unfair on 'naive python'.  I haven't bothered with the second, but just a couple of simple tweaks makes the first one run in under half the time on a random 1m record fastq file:

from sys import stdin, stdout
# from itertools import izip as zip # Uncomment for Py2

for sid, seq, qid, quals in zip(*[iter(stdin)]*4):
stdout.write('{}{}\n{}{}\n'.format(
sid, seq.rstrip()[10:-10],
qid, quals.rstrip()[10:-10]))

// EDIT // For the sake of completeness, here's the second:

from sys import stdin, stdout
# from itertools import izip as zip # Uncomment for Py2

for sid, seq, qid, quals in zip(*[iter(stdin)]*4):
if sum(ord(c)-33 for c in quals)/len(quals) > 30:
stdout.write('{}{}\n{}{}\n'.format(
sid, seq.rstrip()[1:-1],
qid, quals.rstrip()[1:-1]))
0
Entering edit mode

Interestingly it is faster indeed.

I would have thought that all that work of creating four copies of the input stream then expanding into a parameter list to zip on each iteration would be slower than requesting the next item four times. I guess not, Python can work in mysterious ways.

(just to mince words it is not a naive solution though ;-), it is one optimized to some internals)

0
Entering edit mode

Yeah, perhaps it's not so naive :p

The trick behind that zip statement is how Python references the iter objects in memory ...

>>> [iter('foobar')]*2
[<str_iterator object at 0x7f9d68e5ba58>, <str_iterator object at 0x7f9d68e5ba58>]`

... they actually all point to the same object in memory, rather than copies.