Hi Biostars, I am new to handling sequencing data so this is my first time posting here!
So I am trying to remove all reads with any of their base pairs below a threshold quality score, here 32.
I have tried using the FASTQ quality_filter using
fastq_quality_filter -q 32 -p 100 -o output.fastq.gz but when I plot this with FastQC my plot has boxplots with terrible quality score values. When I lower the
-p to something like 95 then the quality scores are very good.
Here is the
-p 100 plot:
Here is the
-p 95 plot:
Why is this?
I then got tired of the FASTQ package with its lacking documentation and downloaded prinseq (using Bioconda). I tried using this command
prinseq-lite.pl -fastq stdin -min_qual_score 32 -out_good null -out_bad stdout | gzip > output.fastq.gz Which is meant to filter any read that has at least one base with a quality score below 32. (At first I tried saving only the
-out_good but it seems that what is "good" is what passed the filter and is therefore all the reads with a min_qual_score below 32, so now I am taking the
Here is the "Good":
Here is the "Bad":
But this still fails me and creates yet another boxplot that has some bases with whiskers that go below 32.
I have no idea why none of my attempts are working to achieve my goal and would love help to know why.
Moreover, I have no idea what the "gold standard" for quality of sequencing reads is. If my currently plan is really dumb I'd love to know what would be better for me to do.
Thanks again in advance for any help!