Question: Fastq Quality Check
5
gravatar for toshnam
7.5 years ago by
toshnam620
Seoul, Republic of Korea
toshnam620 wrote:

Hi all,

I'm trying to check sequencing quality of FASTQ file from HiSeq2000. I used fastx_quality_stats script of FASTX-Toolkit (Version 0.0.13) for it. However I've got an error as follows:

$ fastx_quality_stats -i 6_1.fastq -o 6_1.stats <br />
fastx_quality_stats: Invalid quality score value (char '#' ord 35 quality value -29) on line 4

The FASTQ file really contains "#" character.

@HWI-ST621:210:C03D4ACXX:4:1101:1475:1957 1:N:0:ATCACG
NACTACAATTTACAGATAACTTTAAATTAAATTTTGGAATCAAATATAAAGATTGAAAATGAATTTTGAATATATGAAAATCCATTTAAAGAGTTTGGTAC
+
#1=DDDFFHHDHHIIIJJEHIJJJJJIIIJFIGGJJJFICGIGGGIIJIEIIIIJIJIIIIHIIIJIGGIJIIIJGHIEHJJJHHHHHHHFFF;B@CA;;@

"#" charater is invalid quality score value? I heard this FASTQ file was checked using quality trim program of NGS Cell package of CLCBio, and sequencing quality was good. Then, "#" character is invalid for FASTX-Toolkit only?

I also used Popoolation toolbox (Version 1.2.2) for quality trimming of the FASTQ, and I've got some results as follows:

$trim-fastq.pl --input1 6_1.fastq --input2 6_2.fastq --output trimmed

......................................................

FINISHED: end statistics
Read-pairs processed: 53675033
Read-pairs trimmed in pairs: 0
Read-pairs trimmed as singles: 0


FIRST READ STATISTICS
First reads passing: 0
5p poly-N sequences trimmed: 632578
3p poly-N sequences trimmed: 0
Reads discarded during 'remaining N filtering': 0
Reads discarded during length filtering: 53675033
Count sequences trimed during quality filtering: 53675033

Read length distribution first read
length  count


SECOND READ STATISTICS
Second reads passing: 0
5p poly-N sequences trimmed: 628623
3p poly-N sequences trimmed: 801
Reads discarded during 'remaining N filtering': 0
Reads discarded during length filtering: 53675033
Count sequences trimed during quality filtering: 53675033

Read length distribution second read
length  count

As you see, all of reads were trimmed during the process of quality trimming.
I've been working with some GAII and HiSeq2000 sequence data, but this is the first case. I wonder whether this problem was caused by bad sequencing quality or my mistake.

I appreciate any help.
Thanks.

fastq fastx • 16k views
ADD COMMENTlink modified 7.5 years ago by Rm7.8k • written 7.5 years ago by toshnam620
2

Solution 1. Use an alternative program such as FastQC. Solution 2. Use -Q33 option on Fastx-Toolkit. Thanks, guys :-)

ADD REPLYlink written 7.5 years ago by toshnam620
1

Solution 1. Use an alternative program such as FastQC. Solution 2. Use -Q33 option on Fastx-Toolkit.

ADD REPLYlink written 7.5 years ago by toshnam620
7
gravatar for Rm
7.5 years ago by
Rm7.8k
Danville, PA
Rm7.8k wrote:

Try adding -Q33 option to fastx command and run...

fastx_quality_stats -Q33 i 6_1.fastq -o 6_1.stats
ADD COMMENTlink written 7.5 years ago by Rm7.8k
4
gravatar for toni
7.5 years ago by
toni2.1k
Lyon
toni2.1k wrote:

It seems to be a problem of quality encoding in your file.

Apparently (35-64 = -29) fastx toolkit suppose that your file is in Illumina 1.3+ encoding, whereas your file seems to be in Sanger encoding which has an offset of 33 instead of 64.

Read this for further information on quality scores encoding :

http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2847217/

There may exist options in fastx-toolkit to handle this.

ADD COMMENTlink written 7.5 years ago by toni2.1k

Thank you for your comment. I know the latest fastx-toolkit can read both fastq type, sanger and solexa, basically (Please refer to update news on fastx-toolkit homepage). Also, I confirmed manual of fastx_quality_stats and couldn't find any option for this problem.

ADD REPLYlink written 7.5 years ago by toshnam620
2
gravatar for pmenzel
7.5 years ago by
pmenzel310
pmenzel310 wrote:

Yes, fastx toolkit doesn't work with the quality scores of some versions of the Illumina software.

ADD COMMENTlink written 7.5 years ago by pmenzel310
4

fastx toolkit can use other quality scores, it isn't documented, but with e.g. -Q33 one can use Sanger encoded data.

ADD REPLYlink written 7.5 years ago by Jan Van Haarst300
2

Check fastQC which is good and guess the encoding internally. http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/

ADD REPLYlink written 7.5 years ago by toni2.1k
2

FastQC is very popular : http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/

ADD REPLYlink written 7.5 years ago by Daniel Swan13k

Really? Can you recommend any alternative free program to check sequencing quality of my fastq?

ADD REPLYlink written 7.5 years ago by toshnam620

Thanks, tony and daniel. FastQC is working well with my FASTQ file.

ADD REPLYlink written 7.5 years ago by toshnam620

Thanks, Jan. I confirmed "-Q33" option is working well with my FASTQ file.

ADD REPLYlink written 7.5 years ago by toshnam620

+1 for fastqc, love it.

ADD REPLYlink written 7.5 years ago by pmenzel310

thanks Jan, didn't know that too.

ADD REPLYlink written 7.5 years ago by pmenzel310

I also like a lot SolexaQA http://solexaqa.sourceforge.net/

ADD REPLYlink written 7.5 years ago by Marina Manrique1.3k

Note to commenters: Try to avoid using the comments as a place to answer the question. In this case the answer is what Jan van Haars mentions, that one needs to to pass the option -Q33 to the tool. Comments are for asking clarifications.

ADD REPLYlink written 7.5 years ago by Istvan Albert ♦♦ 80k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 904 users visited in the last hour