This tutorial also serves as the supporting information for Lecture 9 of the course titled Analyzing High Througput Sequencing Data offered at Penn State.
Within this tutorial we compare the efficiency and characteristics of several software tools designed to filter FASTQ files. We choose two simple but common operations and we perform them with different tools.For some tasks we will provide what we call a naive implementation in various languages.
These are implementations that would take a programmer about 5-10 minutes to create. These are for illustrative purposes only.
We want to make a note that the Fastx Toolkit exhibits a strange behavior in that the binary code downloaded for a Mac runs more than five times slower than the binary code compiled locally on the same computer. The times are displayed for the faster version of the Fastx Toolkit. For the NGS Toolkit we have disabled the (otherwise quite helpful) FastQ format detection subroutines.
The test data
s1.fq contains 1 million FASTQ records.
All times are reported in seconds (s). Tools are ordered by runtime.
Clipping is removing parts of each fastq record. For this test we'll remove 10 bases from the start and end.
2.5s with Seqtk:
seqtk trimfq -b 10 -e 10 data/s1.fq > data/tmp.fq
3.7s with Fastx Tookit
fastx_trimmer -Q33 -f 10 -l 80 -i data/s1.fq -o data/tmp.fq
4.3s with Trimmomatic
java -classpath trimmomatic-0.22.jar org.usadellab.trimmomatic.TrimmomaticSE data/s1.fq data/tmp.fq HEADCROP:10 CROP:90
6.4s with the Naive Python (no considerable runtime difference when running via the PyPy optimizer)
python naive_clip.py < data/s1.fq > data/tmp.fq pypy naive_clip.py < data/s1.fq > data/tmp.fq
10.3s with NGS Toolkit:
perl TrimmingReads.pl -i data/s1.fq -l 10 -r 10 -o data/tmp.fq
19.3s with Prinseq:
perl prinseq-lite.pl -fastq data/s1.fq -trim_left 10 -trim_right 10 -out_good data/tmp.fq
Trimming by Quality
For this task sequences are removed or shortened to meet a quality criterion.
We attempted to perform an operation where we either remove reads that have an average quality of 30 or we trim back reads removing bases that have quality lower than 30. In this latter case we limit the trimming to maintain a length of at least 50 bp.
2.1s with Seqtk. Seqtk uses the so called modified Mott algorithm for trimming (trimming stops when hits the limit and no reads are ever discarded).
seqtk trimfq -l 50 data/s1.fq > data/tmp.fq
3.1 seconds with a custom lua program running under the lua JIT compiler (see comments for more info)
luajit lua/trim.lua 30 50 < data/s1.fq > data/tmp.fq
4.8s with Trimmomatic filtering trimming back trailing bases with qualities of under 30 (removes 0.37% of reads):
java -classpath trimmomatic-0.22.jar org.usadellab.trimmomatic.TrimmomaticSE -phred33 data/s1.fq data/tmp.fq TRAILING:30 MINLEN:50
6.8s with the Fastx Tookit, this operation will only keep reads that are above 30 in at least 50% of bases
fastq_quality_filter -Q33 -q 30 -p 50 -i data/s1.fq -o data/tmp.fq
13.3s with the Naive Python that removes reads with an average quality that are lower than 30. Run via the PyPy JIT compiler.
pypy naive_trim.py < data/s1.fq > data/temp.fq
20.0s with cutadapt also implements a Mott type trimming:
cutadapt -q 30 -m 50 data/s1.fq > data/tmp.fq
20.5s with the NGS Toolkit trimming back reads with quality 30 from the end (removes 0.37% of reads).
perl TrimmingReads.pl -i data/s1.fq -q 30 -n 50 -o data/tmp.fq
22.3s with the Naive Python version that removes reads with an average quality that are lower than 30
python naive_trim.py < data/s1.fq > data/temp.fq
37.8s with Biopieces trimming back bases until it reaches of stretch of 3 bases of quality 30 (removes 0.56% of reads):
read_fastq -i data/s1.fq | trim_seq -m 30 | grab -e 'SEQ_LEN>=50' | write_fastq -o data/tmp.fq -x
85.6s with Prinseq removing bases with a quality lower than 30 (removes 0.37% of reads):
perl prinseq-lite.pl -fastq data/s1.fq -trim_qual_right 30 -min_len 50 -out_good data/tmp.fq
Each tool also provides other functions that we have not discussed at all. What we present here is a benchmark to give newcomers an idea about the relative strengths and weaknesses of the various tools. When running a large number of tools there is always the risk of misusing one or more of them and thus concluding that the tool is inefficient or faulty. Please comment below with any comments/corrections.
We can only marvel at the efficiency with which Seqtk reads and processes data. When trimming it blazes through 1 million reads in a mere 2.1 seconds. In parallel the simplicity of invocation and documentation is nothing short of astounding.
Trimmomatic is also a noteworthy competitor with uncommon speed and efficiency. Alas the complexity of understanding the command line presents a heavy and unnecessary burden. The choice of ignoring the many decades of tradition when it comes to passing arguments to a program is also unfortunate. Should any program ever be called
org.usadellab.trimmomatic.TrimmomaticSE? The answer is a resounding no.
At the other extreme we were very surprised by what we consider a subpar performance of the Prinseq tool. This is a software tool has a particularly pleasing webpage and documentation that extol qualities that do not seem to be reflected in the performance of the tool itself.
- Fastx Toolkit: http://hannonlab.cshl.edu/fastx_toolkit/
- Seqtk: https://github.com/lh3/seqtk
- PrinSeq: http://prinseq.sourceforge.net/
- NGS QC Toolkit: http://www.nipgr.res.in/ngsqctoolkit.html
- Trimmomatic: http://www.usadellab.org/cms/index.php?page=trimmomatic
- BioPieces: http://code.google.com/p/biopieces/
- Cutadapt: http://code.google.com/p/cutadapt/
TRUTH: "Should any program ever be called org.usadellab.trimmomatic.TrimmomaticSE? The answer is a resounding no."
A comparison of code written in different languages parsing FASTQ files can be found here: How To Efficiently Parse A Huge Fastq File?
Very good and straightforward tuto ! There is also SolexaQA for trimming: http://solexaqa.sourceforge.net/
Could you add cutadapt? Seems a useful tool as well. http://code.google.com/p/cutadapt/
added the tool - it takes about 20s which is pretty good considering that it has a far more sophisticated trimming algorithm than most
Thanks! Adding some characters to get the comment accepted...
I wrote a simple one in lua here that does either a moving window or a hard trim: https://gist.github.com/1420220
Freakishly fast with luajit: just 3.11 seconds!
Here is another one I just came across: reaper http://www.ebi.ac.uk/~stijn/reaper/reaper.html
In trimmomatic, when they say LEADING or TRAILING bases under a certain quality score. How many leading or trailing bases are they referring to? is it just one?
please don't add the above as a new answer to this thread. Ask a new question or a new comment.
I have just been using Trimmomatic and the command line execution is now much simpler than in the example. Someone has gone to the effort of making it much easier to run!
is this the link to all the whole course? http://www.personal.psu.edu/users//i/u/iua1/courses/2011-BMMB-597D.html
course links are occasionally "rearranged" but you can always navigate there via my home page http://www.personal.psu.edu/iua1/index.html
I was use NGS QC Toolkit (IlluQC.pl),and it keep throw the error :
Please help me !!!!! Thank you
I suggest you make a new post with this issue. This isn't an answer to this original thread and people won't be looking here to help you.
OK thank you anyway !!!
Very informative! Sorry to comment on an aging post - but I think I might see an error:
Specifically re. "we attempted to perform an operation where we either remove reads that have an average quality of 30 or we trim back reads removing bases that have quality lower than 30" but the call to seqtk trimfq doesn't seem to specify the appropriate error probability threshold, wouldn't it be necessary to set "-q 0.001"? (Since the default is 0.05, which, if my understanding is correct, would correspond to a quality score of ~13?)
yes there is a little inconsistency there, the way seqtk works the desired operation cannot be easily translated into parameters. I forgot the exact details but I believe it keeps a running sum of the error rather than considering each base individually so the concept of sharp cutoff does not apply.
Hi, can you make s1.fq available via FTP? Thanks.