This tutorial also serves as the supporting information for Lecture 9 of the course titled Analyzing High Througput Sequencing Data offered at Penn State.
Within this tutorial we compare the efficiency and characteristics of several software tools designed to filter FASTQ files. We choose two simple but common operations and we perform them with different tools.For some tasks we will provide what we call a naive implementation in various languages.
These are implementations that would take a programmer about 5-10 minutes to create. These are for illustrative purposes only.
We want to make a note that the Fastx Toolkit exhibits a strange behavior in that the binary code downloaded for a Mac runs more than five times slower than the binary code compiled locally on the same computer. The times are displayed for the faster version of the Fastx Toolkit. For the NGS Toolkit we have disabled the (otherwise quite helpful) FastQ format detection subroutines.
The test data
s1.fq contains 1 million FASTQ records.
All times are reported in seconds (s). Tools are ordered by runtime.
Clipping is removing parts of each fastq record. For this test we'll remove 10 bases from the start and end.
2.5s with Seqtk:
seqtk trimfq -b 10 -e 10 data/s1.fq > data/tmp.fq
3.7s with Fastx Tookit
fastx_trimmer -Q33 -f 10 -l 80 -i data/s1.fq -o data/tmp.fq
4.3s with Trimmomatic
java -classpath trimmomatic-0.22.jar org.usadellab.trimmomatic.TrimmomaticSE data/s1.fq data/tmp.fq HEADCROP:10 CROP:90
6.4s with the Naive Python (no considerable runtime difference when running via the PyPy optimizer)
python naive_clip.py < data/s1.fq > data/tmp.fq
pypy naive_clip.py < data/s1.fq > data/tmp.fq
10.3s with NGS Toolkit:
perl TrimmingReads.pl -i data/s1.fq -l 10 -r 10 -o data/tmp.fq
19.3s with Prinseq:
perl prinseq-lite.pl -fastq data/s1.fq -trim_left 10 -trim_right 10 -out_good data/tmp.fq
Trimming by Quality
For this task sequences are removed or shortened to meet a quality criterion.
We attempted to perform an operation where we either remove reads that have an average quality of 30 or we trim back reads removing bases that have quality lower than 30. In this latter case we limit the trimming to maintain a length of at least 50 bp.
2.1s with Seqtk. Seqtk uses the so called modified Mott algorithm for trimming (trimming stops when hits the limit and no reads are ever discarded).
seqtk trimfq -l 50 data/s1.fq > data/tmp.fq
3.1 seconds with a custom lua program running under the lua JIT compiler (see comments for more info)
luajit lua/trim.lua 30 50 < data/s1.fq > data/tmp.fq
4.8s with Trimmomatic filtering trimming back trailing bases with qualities of under 30 (removes 0.37% of reads):
java -classpath trimmomatic-0.22.jar org.usadellab.trimmomatic.TrimmomaticSE -phred33 data/s1.fq data/tmp.fq TRAILING:30 MINLEN:50
6.8s with the Fastx Tookit, this operation will only keep reads that are above 30 in at least 50% of bases
fastq_quality_filter -Q33 -q 30 -p 50 -i data/s1.fq -o data/tmp.fq
13.3s with the Naive Python that removes reads with an average quality that are lower than 30. Run via the PyPy JIT compiler.
pypy naive_trim.py < data/s1.fq > data/temp.fq
20.0s with cutadapt also implements a Mott type trimming:
cutadapt -q 30 -m 50 data/s1.fq > data/tmp.fq
20.5s with the NGS Toolkit trimming back reads with quality 30 from the end (removes 0.37% of reads).
perl TrimmingReads.pl -i data/s1.fq -q 30 -n 50 -o data/tmp.fq
22.3s with the Naive Python version that removes reads with an average quality that are lower than 30
python naive_trim.py < data/s1.fq > data/temp.fq
37.8s with Biopieces trimming back bases until it reaches of stretch of 3 bases of quality 30 (removes 0.56% of reads):
read_fastq -i data/s1.fq | trim_seq -m 30 | grab -e 'SEQ_LEN>=50' | write_fastq -o data/tmp.fq -x
85.6s with Prinseq removing bases with a quality lower than 30 (removes 0.37% of reads):
perl prinseq-lite.pl -fastq data/s1.fq -trim_qual_right 30 -min_len 50 -out_good data/tmp.fq
Each tool also provides other functions that we have not discussed at all. What we present here is a benchmark to give newcomers an idea about the relative strengths and weaknesses of the various tools. When running a large number of tools there is always the risk of misusing one or more of them and thus concluding that the tool is inefficient or faulty. Please comment below with any comments/corrections.
We can only marvel at the efficiency with which Seqtk reads and processes data. When trimming it blazes through 1 million reads in a mere 2.1 seconds. In parallel the simplicity of invocation and documentation is nothing short of astounding.
Trimmomatic is also a noteworthy competitor with uncommon speed and efficiency. Alas the complexity of understanding the command line presents a heavy and unnecessary burden. The choice of ignoring the many decades of tradition when it comes to passing arguments to a program is also unfortunate. Should any program ever be called
org.usadellab.trimmomatic.TrimmomaticSE? The answer is a resounding no.
At the other extreme we were very surprised by what we consider a subpar performance of the Prinseq tool. This is a software tool has a particularly pleasing webpage and documentation that extol qualities that do not seem to be reflected in the performance of the tool itself.
- Fastx Toolkit: http://hannonlab.cshl.edu/fastx_toolkit/
- Seqtk: https://github.com/lh3/seqtk
- PrinSeq: http://prinseq.sourceforge.net/
- NGS QC Toolkit: http://www.nipgr.res.in/ngsqctoolkit.html
- Trimmomatic: http://www.usadellab.org/cms/index.php?page=trimmomatic
- BioPieces: http://code.google.com/p/biopieces/
- Cutadapt: http://code.google.com/p/cutadapt/