Here is a fun riddle I have been working on for the last day or two. Can Q10 data be better that Q30 data? I have been trying to figure out what is the most fair way to filter out "low" quality data. We all agree that the Phred scale is helpful with this question with a Q10 representing a 1:10 error rate or a 90% correctness, with Q20 and Q30 being .99 and .999 correct.
So using the fastx toolkit to filter out lower quality reads you are allowed to tweak the Q score minimum and the the percent of each read having that minimum Q score. for example if I want a minimum of Q30 and each read has to have 100% of it at Q30 I know for a fact that my reads are 0.999% correct. Great! The highest level of quality I can have also leaves me with almost no data to work with (I'm working with Ion Torrent data my average now is around Q33, but earlier data was around Q28), so we make trade offs. How much error am I able to limit while still having "high" quality data? So I say that I can live with Q20 100% of the time and only have reads that are at least .99% correct.
However here is where it gets fun, so lets say I just want reads that have 80% Q30 this means I would have a 20% error rate across all the the bases. If this is how you would figure it out 1 - (0.8 * 0.999) where 0.8 is the 80% of the read and the 0.999 is the minimum quality. So that would imply that a 90% Q20 setting would have less error than 80% Q30; which is still small than 100% Q10 data.
So if this correct should all data just be filtered to be at least Q10 across 100% of all the reads and then move forward in the pipeline?