Filtering sanger_fastq, illumina_fastq files using Fastx-Toolkit
7.5 years ago
bioinfo ▴ 830

I am doing some quality filtering of a large bunch of mixed fastq files produced from multiple versions of Illumina platforms. Thus, the quality scores are sanger_fastq format for some (quality ASCII offset 33) and for others its lluminav1.3+_fastq (quality ASCII offset 64) and so on.

Case 1: If you use sanger_quality format files without parameter -Q33 you get an error message fastq_quality_filter: Invalid quality score value...

Case 2: but if you wrongly use -Q33 for reads with illumina_quality format, you get error messages like segmentation fault (core dumped) or

\$ fastq_quality_filter -i file.fastq -o OUT -v -q 20 -p 50 -Q33
fastq_quality_filter: bug: got empty array at fastq_quality_filter.c:97


Is there any special trick exists in fastx-tookit that automatically detect the quality format (ASCII offset 33 or 64) and does the quality filtering afterwards accordingly without separating the mixed fastq files?

 S - Sanger        Phred+33,  raw reads typically (0, 40)
X - Solexa        Solexa+64, raw reads typically (-5, 40)
I - Illumina 1.3+ Phred+64,  raw reads typically (0, 40)
J - Illumina 1.5+ Phred+64,  raw reads typically (3, 40)
with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold)
(Note: See discussion above).
L - Illumina 1.8+ Phred+33,  raw reads typically (0, 41)

7.5 years ago
SES 8.5k

I don't think this exists in fastx-toolkit, but fastqc will report the encoding of the data and you can run it from the command line for processing many files. That may be the easiest, though I know there are also some standalone scripts for detecting the encoding, so that would be another option for building a trimming pipeline.

7.5 years ago
bioinfo ▴ 830

I am actually a fastqc fan but this time I am running an in-house software-pipeline where only certain tools such as Seqtk, Fastx-Toolkit and usearch etc. are preinstalled. So, the software will use its fastq_quality_filter (of fastx-toolkit) in the certain step before doing any downstream analysis. The only option I have is to fix the encoding and convert over all (over 500!!!) fastq files to "ONE" particular quality-score-encoded format before putting into the pipeline. As you mentioned, do you have any links for standalone scripts to do that?

I saw one option in the wiki page of fastq-format to convert illumina1.3 (phred64) to 1.8 (phred33). But to do that I have to detect first which files are in phred64 quality score-format and then separate them out and convert to phred33 to make all in phred33.!!!

reformat.sh in BBTools will autodetect and convert qualities:

reformat.sh in=file.fq out=fixed.fq qout=33


If you do that, the files that were already ASCII-33 will be unchanged. It can also do quality-filtering and trimming (with the trimq and maq flags), and unlike fastx-toolkit can handle paired reads. Overall I'd recommend abandoning fastx-toolkit.

Note, by the way, that it is not possible to autodetect quality encoding with 100% confidence because ASCII-33 and ASCII-64/ASCII-66 can have values in the same range.

Wow.. every time I look for a solution for a problem and end up with very efficient new tools and packages that I didn't know before. BBTools is a nice package and has many features. Thanks Brian.

Ideally you could capture the output of a script and pass it to fastx-trimmer, or follow Brian's suggestion.