Question: Filtering sanger_fastq, illumina_fastq files using Fastx-Toolkit
1
gravatar for bioinfo
4.7 years ago by
bioinfo740
bioinfo740 wrote:

I am doing some quality filtering of a large bunch of mixed fastq files produced from multiple versions of Illumina platforms. Thus, the quality scores are sanger_fastq format for some ( quality ASCII offset 33) and for others its lluminav1.3+_fastq (quality ASCII offset 64) and so on.

Case 1: If you use sanger_quality format files without parameter -Q33 you get an error message "fastq_quality_filter: Invalid quality score value..."...!! 

Case 2: but if you wrongly use -Q33 for reads with illumina_quality format, you get error messages like
segmentation fault (core dumped) or

>fastq_quality_filter -i file.fastq -o OUT -v -q 20 -p 50 -Q33
fastq_quality_filter: bug: got empty array at fastq_quality_filter.c:97

Is there any special trick exists in fastx-tookit that automatically detect the quality format (ASCII offset 33 or 64)  and does the quality filtering afterwards accordingly without separating the mixed fastq files?

EDIT: http://en.wikipedia.org/wiki/FASTQ_format​ (Some reading)

 S - Sanger        Phred+33,  raw reads typically (0, 40)
 X - Solexa        Solexa+64, raw reads typically (-5, 40)
 I - Illumina 1.3+ Phred+64,  raw reads typically (0, 40)
 J - Illumina 1.5+ Phred+64,  raw reads typically (3, 40)
     with 0=unused, 1=unused, 2=Read Segment Quality Control Indicator (bold) 
     (Note: See discussion above).
 L - Illumina 1.8+ Phred+33,  raw reads typically (0, 41)

 

 

filtering fastx fastq illumina • 3.9k views
ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by bioinfo740
0
gravatar for SES
4.7 years ago by
SES8.2k
Vancouver, BC
SES8.2k wrote:

Is there any special trick exists in fastx-tookit that automatically detect the quality format (ASCII offset 33 or 64)  and does the quality filtering afterwards accordingly without separating the mixed fastq files?

I don't think this exists in fastx-toolkit, but fastqc will report the encoding of the data and you can run it from the command line for processing many files. That may be the easiest, though I know there are also some standalone scripts for detecting the encoding, so that would be another option for building a trimming pipeline.

ADD COMMENTlink written 4.7 years ago by SES8.2k
0
gravatar for bioinfo
4.7 years ago by
bioinfo740
bioinfo740 wrote:

I am actually a fastqc fan but this time I am running an in-house software-pipeline where only certain tools such as Seqtk, Fastx-Toolkit and usearch etc. are preinstalled. So, the software will use its fastq_quality_filter (of fastx-toolkit) in the certain step before doing any downstream analysis. The only option I have is to fix the encoding and convert over all (over 500!!!) fastq files to "ONE" particular quality-score-encoded format before putting into the pipeline. As you mentioned, do you have any links for standalone  scripts to do that?

I saw one option in the wiki page of fastq-format to convert illumina1.3 (phred64) to 1.8 (phred33). But to do that I have to detect first which files are in phred64 quality score-format and then separate them out and convert to phred33 to make all in phred33.!!!

ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by bioinfo740
1

reformat.sh in BBTools will autodetect and convert qualities:

reformat.sh in=file.fq out=fixed.fq qout=33

If you do that, the files that were already ASCII-33 will be unchanged.  It can also do quality-filtering and trimming (with the trimq and maq flags), and unlike fastx-toolkit can handle paired reads.  Overall I'd recommend abandoning fastx-toolkit.

Note, by the way, that it is not possible to autodetect quality encoding with 100% confidence because ASCII-33 and ASCII-64/ASCII-66 can have values in the same range.

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by Brian Bushnell17k
1

Wow.. every time I look for a solution for a problem and end up with very efficient new tools and packages that I didn't know before. BBTools is a nice package and has many features. Thanks Brian.

ADD REPLYlink written 4.7 years ago by bioinfo740

Here are some links to discussions about standalone scripts: Tool To Find Out If Fastq Is In Sanger Or Phred64 Encoding?, http://seqanswers.com/forums/showthread.php?t=16562

Ideally you could capture the output of a script and pass it to fastx-trimmer, or follow Brian's suggestion.

ADD REPLYlink written 4.7 years ago by SES8.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1002 users visited in the last hour