Question

How to extract reads passing a threshold for Mean Sequence Quality

0

Entering edit mode

6.8 years ago

newbinf • 0

I have some RNA-seq data that I trimmed and then ran through FastQC to check the quality. For the R2 file for my read, there seems to be a bad quality score for a large portion of the reads. A lot of the reads are in the red zone.

This shows that the overall quality of the R2 reads are relatively bad (most of which average in the red zone)

However, closer examination into the Mean Sequence Quality of each reads, it looks like there are two populations: one large set of reads with bad quality throughout the reads (hence the bad mean sequence quality) and one set of reads with overall good quality (hence the >30 mean sequence quality scores)

The mean sequence quality shows there are two populations

I would like to isolate the good quality reads with the >30 mean sequence quality. It looks like some mapping programs (like STAR) look at the quality score average for each base. However, that would not help to isolate the good quality reads because the calculated average for each base includes the quality scores of both the bad and good quality reads.

Is there a separate program out there that can remove the low Mean Sequence Quality reads. Or do some alignment programs already take this into account? Or, is my logic to separate reads of differing quality inherently flawed?

Also, I am assuming a good quality read has a Mean Sequence Quality > 30.

RNA-Seq • 3.1k views

ADD COMMENT • link updated 6.8 years ago by GenoMax 152k • written 6.8 years ago by newbinf • 0

score 1 · Accepted Answer · 2018-09-27

You could use bbduk.shin quality filter mode. You can find a detailed guide here. A representative command would be (for single end reads, use in1=, in2= and out1=,out2= for PE reads:

$ bbduk.sh in=reads.fq out=clean.fq maq=10

This will discard reads with average quality below 10. If quality-trimming is enabled, the average quality will be calculated on the trimmed read.

Also, I am assuming a good quality read has a Mean Sequence Quality > 30.

That depends. If you have a good reference sequence you are aligning to then data down to Q10 or Q15 may be acceptable. For any de novo work you would want to be strict. Q25 or Q30 and above.

BTW: This data does look a bit ugly. Make sure you also scan and trim for presence of adapter sequences.