Question: Filtering Read Quality - Cant remove all reads that have a single base below a threshold
gravatar for tbb21
3 months ago by
tbb2110 wrote:

Hi Biostars, I am new to handling sequencing data so this is my first time posting here!

So I am trying to remove all reads with any of their base pairs below a threshold quality score, here 32.

I have tried using the FASTQ quality_filter using fastq_quality_filter -q 32 -p 100 -o output.fastq.gz but when I plot this with FastQC my plot has boxplots with terrible quality score values. When I lower the -p to something like 95 then the quality scores are very good.

Here is the -p 100 plot:


Here is the -p 95 plot:


Why is this?

I then got tired of the FASTQ package with its lacking documentation and downloaded prinseq (using Bioconda). I tried using this command -fastq stdin -min_qual_score 32 -out_good null -out_bad stdout | gzip > output.fastq.gz Which is meant to filter any read that has at least one base with a quality score below 32. (At first I tried saving only the -out_good but it seems that what is "good" is what passed the filter and is therefore all the reads with a min_qual_score below 32, so now I am taking the -out_bad).

Here is the "Good":


Here is the "Bad":


But this still fails me and creates yet another boxplot that has some bases with whiskers that go below 32.

I have no idea why none of my attempts are working to achieve my goal and would love help to know why.

Moreover, I have no idea what the "gold standard" for quality of sequencing reads is. If my currently plan is really dumb I'd love to know what would be better for me to do.

Thanks again in advance for any help!

ADD COMMENTlink modified 4 weeks ago by Biostar ♦♦ 20 • written 3 months ago by tbb2110

single read below a threshold or it is single base (in title)?

ADD REPLYlink written 3 months ago by Jeffin Rockey1.1k

Sorry its a single base. I have corrected this.

ADD REPLYlink written 3 months ago by tbb2110

Please use these instructions to add images to your original post from your FastQC scans so we can see what your plots look like: How to add images to a Biostars post

There is no gold standard for Q scores. It all depends on what you need to do. In general, if you are aligning to a good reference genome then you can get away without doing any Q-score filtering. Q scores down to Q10-15 will work in this case.

If you are planning on doing de novo assembly work then it would be justifiable to use Q-score filtering at Q25 and above.

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax64k

Thanks re the gold standard and I have now saved and uploaded all of the relevant plots!

ADD REPLYlink written 3 months ago by tbb2110

Have you scanned/trimmed this data for presence of Illumina adapter sequences? It is possible that you have a few short reads and the whiskers may represent those bases (which generally will get low Q scores). In general your data has median Q scores of >32 as expected. Is this a new dataset with Sanger encoding or are you using something downloaded from SRA?

I also recommend that you take a look at program from BBMap suite. It is a scan/trim program among other things and has easily understandable options.

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax64k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1039 users visited in the last hour