Question: How to extract reads passing a threshold for Mean Sequence Quality
0
gravatar for newbinf
14 months ago by
newbinf0
newbinf0 wrote:

I have some RNA-seq data that I trimmed and then ran through FastQC to check the quality. For the R2 file for my read, there seems to be a bad quality score for a large portion of the reads. A lot of the reads are in the red zone.

This shows that the overall quality of the R2 reads are relatively bad (most of which average in the red zone)

However, closer examination into the Mean Sequence Quality of each reads, it looks like there are two populations: one large set of reads with bad quality throughout the reads (hence the bad mean sequence quality) and one set of reads with overall good quality (hence the >30 mean sequence quality scores)

The mean sequence quality shows there are two populations

I would like to isolate the good quality reads with the >30 mean sequence quality. It looks like some mapping programs (like STAR) look at the quality score average for each base. However, that would not help to isolate the good quality reads because the calculated average for each base includes the quality scores of both the bad and good quality reads.

Is there a separate program out there that can remove the low Mean Sequence Quality reads. Or do some alignment programs already take this into account? Or, is my logic to separate reads of differing quality inherently flawed?

Also, I am assuming a good quality read has a Mean Sequence Quality > 30.

rna-seq • 614 views
ADD COMMENTlink modified 14 months ago by genomax75k • written 14 months ago by newbinf0
1
gravatar for genomax
14 months ago by
genomax75k
United States
genomax75k wrote:

You could use bbduk.shin quality filter mode. You can find a detailed guide here. A representative command would be (for single end reads, use in1=, in2= and out1=,out2= for PE reads:

$ bbduk.sh in=reads.fq out=clean.fq maq=10

This will discard reads with average quality below 10. If quality-trimming is enabled, the average quality will be calculated on the trimmed read.

Also, I am assuming a good quality read has a Mean Sequence Quality > 30.

That depends. If you have a good reference sequence you are aligning to then data down to Q10 or Q15 may be acceptable. For any de novo work you would want to be strict. Q25 or Q30 and above.

BTW: This data does look a bit ugly. Make sure you also scan and trim for presence of adapter sequences.

ADD COMMENTlink modified 14 months ago • written 14 months ago by genomax75k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1983 users visited in the last hour