Tutorial: Fastq Quality Control Shootout
71
gravatar for Istvan Albert
5.8 years ago by
Istvan Albert ♦♦ 77k
University Park, USA
Istvan Albert ♦♦ 77k wrote:

This tutorial also serves as the supporting information for Lecture 9 of the course titled Analyzing High Througput Sequencing Data offered at Penn State.

Within this tutorial we compare the efficiency and characteristics of several software tools designed to filter FASTQ files. We choose two simple but common operations and we perform them with different tools.For some tasks we will provide what we call a naive implementation in various languages. These are implementations that would take a programmer about 5-10 minutes to create. These are for illustrative purposes only.

We want to make a note that the Fastx Toolkit exhibits a strange behavior in that the binary code downloaded for a Mac runs more than five times slower than the binary code compiled locally on the same computer. The times are displayed for the faster version of the Fastx Toolkit. For the NGS Toolkit we have disabled the (otherwise quite helpful) FastQ format detection subroutines.

The test data s1.fq contains 1 million FASTQ records.

All times are reported in seconds (s). Tools are ordered by runtime.

Clipping Sequences

Clipping is removing parts of each fastq record. For this test we'll remove 10 bases from the start and end.

2.5s with Seqtk:

seqtk trimfq -b 10 -e 10 data/s1.fq > data/tmp.fq

3.7s with Fastx Tookit

fastx_trimmer -Q33 -f 10 -l 80 -i data/s1.fq -o data/tmp.fq

4.3s with Trimmomatic

java -classpath trimmomatic-0.22.jar org.usadellab.trimmomatic.TrimmomaticSE data/s1.fq data/tmp.fq HEADCROP:10 CROP:90

6.4s with the Naive Python (no considerable runtime difference when running via the PyPy optimizer)

python naive_clip.py < data/s1.fq > data/tmp.fq
pypy naive_clip.py < data/s1.fq > data/tmp.fq

10.3s with NGS Toolkit:

perl TrimmingReads.pl -i data/s1.fq -l 10 -r 10 -o data/tmp.fq

19.3s with Prinseq:

perl prinseq-lite.pl -fastq data/s1.fq -trim_left 10 -trim_right 10 -out_good data/tmp.fq

Trimming by Quality

For this task sequences are removed or shortened to meet a quality criterion.

We attempted to perform an operation where we either remove reads that have an average quality of 30 or we trim back reads removing bases that have quality lower than 30. In this latter case we limit the trimming to maintain a length of at least 50 bp.

2.1s with Seqtk. Seqtk uses the so called modified Mott algorithm for trimming (trimming stops when hits the limit and no reads are ever discarded).

seqtk trimfq -l 50 data/s1.fq > data/tmp.fq

3.1 seconds with a custom lua program running under the lua JIT compiler (see comments for more info)

luajit lua/trim.lua 30 50 < data/s1.fq  > data/tmp.fq

4.8s with Trimmomatic filtering trimming back trailing bases with qualities of under 30 (removes 0.37% of reads):

java -classpath trimmomatic-0.22.jar org.usadellab.trimmomatic.TrimmomaticSE -phred33 data/s1.fq data/tmp.fq TRAILING:30 MINLEN:50`

6.8s with the Fastx Tookit, this operation will only keep reads that are above 30 in at least 50% of bases

fastq_quality_filter -Q33 -q 30 -p 50 -i data/s1.fq -o data/tmp.fq

13.3s with the Naive Python that removes reads with an average quality that are lower than 30. Run via the PyPy JIT compiler.

pypy naive_trim.py < data/s1.fq > data/temp.fq

20.0s with cutadapt also implements a Mott type trimming:

cutadapt -q 30 -m 50 data/s1.fq > data/tmp.fq

20.5s with the NGS Toolkit triming back reads with quality 30 from the end (removes 0.37% of reads).

perl TrimmingReads.pl -i data/s1.fq -q 30 -n 50 -o data/tmp.fq

22.3s with the Naive Python version that removes reads with an average quality that are lower than 30

python naive_trim.py < data/s1.fq > data/temp.fq

37.8s with Biopieces trimming back bases until it reaches of strech of 3 bases of quality 30 (removes 0.56% of reads):

read_fastq -i data/s1.fq | trim_seq -m 30 | grab -e 'SEQ_LEN>=50' | write_fastq -o data/tmp.fq -x

85.6s with Prinseq removing bases with a quality lower than 30 (removes 0.37% of reads):

perl prinseq-lite.pl -fastq data/s1.fq -trim_qual_right 30 -min_len 50 -out_good data/tmp.fq

Closing thoughts

Each tool also provides other functions that we have not discussed at all. What we present here is a benchmark to give newcomers an idea about the relative strenghts and weaknesses of the various tools. When running a large number of tools there is always the risk of misusing one or more of them and thus concluding that the tool is inefficient or faulty. Please comment below with any comments/corrections.

We can only marvel at the efficiency with which Seqtk reads and processes data. When trimming it blazes through 1 million reads in a mere 2.1 seconds. In parallel the simplicity of invocation and documentation is nothing short of astounding.

Trimmomatic is also a noteworthy competitor with uncommon speed and efficiency. Alas the complexity of understanding the command line presents a heavy and unnecessary burden. The choice of ignoring the many decades of tradition when it comes to passing arguments to a program is also unfortunate. Should any program ever be called org.usadellab.trimmomatic.TrimmomaticSE? The answer is a resounding no.

At the other extreme we were very suprised by what we consider a subpar performance of the Prinseq tool. This is a software tool has a particularly pleasing webpage and documentation that extoll qualities that do not seem to be reflected in the perfomance of the tool itself.

Tool List

trimming fastq qc quality tutorial • 25k views
ADD COMMENTlink modified 3.5 years ago by caelyn520131410 • written 5.8 years ago by Istvan Albert ♦♦ 77k
9

TRUTH: "Should any program ever be called org.usadellab.trimmomatic.TrimmomaticSE? The answer is a resounding no."

ADD REPLYlink written 5.7 years ago by Madelaine Gogol5.0k
2

A comparison of code written in different languages parsing FASTQ files can be found here: How To Efficiently Parse A Huge Fastq File?

ADD REPLYlink written 5.4 years ago by Ying W3.8k
1

Very good and straightforward tuto ! There is also SolexaQA for trimming: http://solexaqa.sourceforge.net/

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by Pasta1.3k

Could you add cutadapt? Seems a useful tool as well. http://code.google.com/p/cutadapt/

ADD REPLYlink written 5.7 years ago by lexnederbragt1.2k
1

added the tool - it takes about 20s which is pretty good considering that it has a far more sophisticated trimming algorithm than most

ADD REPLYlink written 5.7 years ago by Istvan Albert ♦♦ 77k

Thanks! Adding some characters to get the comment accepted...

ADD REPLYlink written 5.7 years ago by lexnederbragt1.2k

I wrote a simple one in lua here that does either a moving window or a hard trim: https://gist.github.com/1420220

ADD REPLYlink written 5.7 years ago by brentp22k

Freakishly fast with luajit: just 3.11 seconds!

ADD REPLYlink modified 5.7 years ago • written 5.7 years ago by Istvan Albert ♦♦ 77k

Here is another one I just came across: reaper http://www.ebi.ac.uk/~stijn/reaper/reaper.html

ADD REPLYlink written 5.7 years ago by lexnederbragt1.2k

Hi there,

In trimmomatic, when they say LEADING or TRAILING bases under a certain quality score. How many leading or trailing bases are they referring to? is it just one?

ADD REPLYlink written 5.3 years ago by alabri.mohammed0
2

please don't add the above as a new answer to this thread. Ask a new question or a new comment.

ADD REPLYlink written 5.3 years ago by Istvan Albert ♦♦ 77k

I have just been using Trimmomatic and the command line execution is now much simpler than in the example. Someone has gone to the effort of making it much easier to run!

ADD REPLYlink written 5.0 years ago by Ian5.2k

is this the link to all the whole course? http://www.personal.psu.edu/users//i/u/iua1/courses/2011-BMMB-597D.html

ADD REPLYlink written 4.7 years ago by Medhat7.6k

course links are occasionally "rearranged" but you can always navigate there via my home page http://www.personal.psu.edu/iua1/index.html

ADD REPLYlink written 4.7 years ago by Istvan Albert ♦♦ 77k

Hi,I was use NGS QC Toolkit IlluQC.pl),and it keep throw the error :

Use of uninitialized value $avgQRangeF2 in split at /leofs/zhangz_group/xial/NGSQCToolkit_v2.3.3/QC/lib/html.pl line 25.

Use of uninitialized value $seqFormatName in concatenation (.) or string at /leofs/zhangz_group/xial/NGSQCToolkit_v2.3.3/QC/lib/html.pl line 142.

please help me !!!!! thank you 

ADD REPLYlink written 3.5 years ago by caelyn520131410

I suggest you make a new post with this issue.  This isn't an answer to this original thread and people won't be looking here to help you.

ADD REPLYlink written 3.5 years ago by george.ry1.1k

OK thank you anyway !!!

ADD REPLYlink written 3.5 years ago by caelyn520131410

Very informative! Sorry to comment on an aging post - but I think I might see an error:

Specifically re. "we attempted to perform an operation where we either remove reads that have an average quality of 30 or we trim back reads removing bases that have quality lower than 30" but the call to seqtk trimfq doesn't seem to specify the appropriate error probability threshold, wouldn't it be necessary to set "-q 0.001"? (Since the default is 0.05, which, if my understanding is correct, would correspond to a quality score  of ~13?) 

ADD REPLYlink modified 3.1 years ago • written 3.1 years ago by daniel.delbalso0
1

yes there is a little inconsistency there, the way seqtk works the desired operation cannot be easily translated into parameters. I forgot the exact details but I believe it keeps a running sum of the error rather than considering each base individually so the concept of sharp cutoff does not apply.

ADD REPLYlink written 3.1 years ago by Istvan Albert ♦♦ 77k

Hi, can you make s1.fq available via FTP? Thanks.

ADD REPLYlink written 2.1 years ago by Dan480
2
gravatar for Martin A Hansen
5.4 years ago by
Martin A Hansen3.0k
Denmark
Martin A Hansen3.0k wrote:

Trim with Biopieces can be done like this:

read_fastq -i in.fq | extract_seq -b 10 | reverse_seq | extract_seq -b 10 | reverse_seq | write_fastq -o out.fq -x

or if you are working with fixed length reads simply:

read_fastq -i in.fq | extract_seq -b 10 -e 80 | write_fastq -o out.fq -x

(I really should upgrade extract_seq to use negative values to offset from the end).

I guess both will be slowish. However, Biopieces and GNU parallel work wonders. It would be interesting if you'd add benchmarks for that combination: http://code.google.com/p/biopieces/wiki/HowTo#Example_processing_a_FASTQ_file

ADD COMMENTlink written 5.4 years ago by Martin A Hansen3.0k
1
gravatar for Istvan Albert
5.7 years ago by
Istvan Albert ♦♦ 77k
University Park, USA
Istvan Albert ♦♦ 77k wrote:

The code for the "naive" implementations used above. This code requires that the sequence and quality appear on a single line:

Naive clipping with Python

import sys, string, itertools

stream = itertools.imap(string.strip, sys.stdin)

for id in stream:
    print id
    print stream.next()[10:-10]
    print stream.next()
    print stream.next()[10:-10]

Naive trimming with Python

import sys, string, itertools

def decode(x):
    return ord(x) - 33

def average(text):
    vals = map(decode, text)
    return sum(vals)/float(len(text))

stream = itertools.imap(string.strip, sys.stdin)

for id in stream:
    s, p, q = stream.next(), stream.next(), stream.next()
    avg = average(q)
    if avg > 30:
        print "%s\n%s\n%s\n%s" % (id, s, p, q)
ADD COMMENTlink written 5.7 years ago by Istvan Albert ♦♦ 77k
1

I'm aware this thread has just been resurrected, but I figured I'd just comment that these snippets are perhaps a bit unfair on 'naive python'.  I haven't bothered with the second, but just a couple of simple tweaks makes the first one run in under half the time on a random 1m record fastq file:

from sys import stdin, stdout
# from itertools import izip as zip # Uncomment for Py2

for sid, seq, qid, quals in zip(*[iter(stdin)]*4):
    stdout.write('{}{}\n{}{}\n'.format(
        sid, seq.rstrip()[10:-10],
        qid, quals.rstrip()[10:-10]))

// EDIT // For the sake of completeness, here's the second:

from sys import stdin, stdout
# from itertools import izip as zip # Uncomment for Py2

for sid, seq, qid, quals in zip(*[iter(stdin)]*4):
    if sum(ord(c)-33 for c in quals)/len(quals) > 30:
        stdout.write('{}{}\n{}{}\n'.format(
            sid, seq.rstrip()[1:-1],
            qid, quals.rstrip()[1:-1]))
ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by george.ry1.1k

Interestingly it is faster indeed.  

I would have thought that all that work of creating four copies of the input stream then expanding into a parameter list to zip on each iteration would be slower than requesting the next item four times. I guess not, Python can work in mysterious ways.

(just to mince words it is not a naive solution though ;-), it is one optimized to some internals)
 

ADD REPLYlink modified 3.5 years ago • written 3.5 years ago by Istvan Albert ♦♦ 77k

Yeah, perhaps it's not so naive :p

The trick behind that zip statement is how Python references the iter objects in memory ...

>>> [iter('foobar')]*2
[<str_iterator object at 0x7f9d68e5ba58>, <str_iterator object at 0x7f9d68e5ba58>]

... they actually all point to the same object in memory, rather than copies.

ADD REPLYlink written 3.5 years ago by george.ry1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 636 users visited in the last hour