Question

Filtering sanger_fastq, illumina_fastq files using Fastx-Toolkit

1

Entering edit mode

10.3 years ago

bioinfo ▴ 840

I am doing some quality filtering of a large bunch of mixed fastq files produced from multiple versions of Illumina platforms. Thus, the quality scores are sanger_fastq format for some (quality ASCII offset 33) and for others its lluminav1.3+_fastq (quality ASCII offset 64) and so on.

Case 1: If you use sanger_quality format files without parameter -Q33 you get an error message fastq_quality_filter: Invalid quality score value...

Case 2: but if you wrongly use -Q33 for reads with illumina_quality format, you get error messages like segmentation fault (core dumped) or

$ fastq_quality_filter -i file.fastq -o OUT -v -q 20 -p 50 -Q33
fastq_quality_filter: bug: got empty array at fastq_quality_filter.c:97

Is there any special trick exists in fastx-tookit that automatically detect the quality format (ASCII offset 33 or 64) and does the quality filtering afterwards accordingly without separating the mixed fastq files?

EDIT: http://en.wikipedia.org/wiki/FASTQ_format (Some reading)

 S - Sanger        Phred+33,  raw reads typically (0, 40)
 X - Solexa        Solexa+64, raw reads typically (-5, 40)
 I - Illumina 1.3+ Phred+64,  raw reads typically (0, 40)
 J - Illumina 1.5+ Phred+64,  raw reads typically (3, 40)
     with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) 
     (Note: See discussion above).
 L - Illumina 1.8+ Phred+33,  raw reads typically (0, 41)

illumina fastx fastq filtering • 6.2k views

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by bioinfo ▴ 840

Ram · Answer 1 · 2015-04-04

Is there any special trick exists in fastx-tookit that automatically detect the quality format (ASCII offset 33 or 64) and does the quality filtering afterwards accordingly without separating the mixed fastq files?

I don't think this exists in fastx-toolkit, but fastqc will report the encoding of the data and you can run it from the command line for processing many files. That may be the easiest, though I know there are also some standalone scripts for detecting the encoding, so that would be another option for building a trimming pipeline.

Ram · Answer 2 · 2015-04-04

0

Entering edit mode

10.3 years ago

bioinfo ▴ 840

I am actually a fastqc fan but this time I am running an in-house software-pipeline where only certain tools such as Seqtk, Fastx-Toolkit and usearch etc. are preinstalled. So, the software will use its fastq_quality_filter (of fastx-toolkit) in the certain step before doing any downstream analysis. The only option I have is to fix the encoding and convert over all (over 500!!!) fastq files to "ONE" particular quality-score-encoded format before putting into the pipeline. As you mentioned, do you have any links for standalone scripts to do that?

I saw one option in the wiki page of fastq-format to convert illumina1.3 (phred64) to 1.8 (phred33). But to do that I have to detect first which files are in phred64 quality score-format and then separate them out and convert to phred33 to make all in phred33.!!!

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by bioinfo ▴ 840

1

Entering edit mode

reformat.sh in BBTools will autodetect and convert qualities:

reformat.sh in=file.fq out=fixed.fq qout=33

If you do that, the files that were already ASCII-33 will be unchanged. It can also do quality-filtering and trimming (with the trimq and maq flags), and unlike fastx-toolkit can handle paired reads. Overall I'd recommend abandoning fastx-toolkit.

Note, by the way, that it is not possible to autodetect quality encoding with 100% confidence because ASCII-33 and ASCII-64/ASCII-66 can have values in the same range.

ADD REPLY • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Brian Bushnell 20k

1

Entering edit mode

Wow.. every time I look for a solution for a problem and end up with very efficient new tools and packages that I didn't know before. BBTools is a nice package and has many features. Thanks Brian.

ADD REPLY • link 10.3 years ago by bioinfo ▴ 840

0

Entering edit mode

Here are some links to discussions about standalone scripts: Tool To Find Out If Fastq Is In Sanger Or Phred64 Encoding?, http://seqanswers.com/forums/showthread.php?t=16562

Ideally you could capture the output of a script and pass it to fastx-trimmer, or follow Brian's suggestion.

ADD REPLY • link 10.3 years ago by SES 8.6k