Hi everyone. I would like to ask some doubts about the representation of data of FastQC. I have two cleaned files 1.fq and 2.fq of Illumina. When I make a analysis with FastQC. I saw that first file does not have any contigs with Phred score lower of 20. In contrast the second file shows a lot of contigs with lower Phred score of 20. However, when I merged both files into one. This unique file does not show a Phred score low of 20. I attached the images. Somebody can me explain the why of this? is it trustful?
File All merged
Thanks. Then, do you think that is useful to eliminate this sequences with lower quality? I did it. It down from 50M of sequences to 35M.
there's no general rule for this, it really depends on the dataset. anyway losing 30% of the reads is usually a bit too much and it doesn't seem that you need such a strong quality selection. what program did you use for quality control? I would suggest you to relax a bit its options, in order to save some more reads. remember that a phred quality score of 20 means that the estimated probability of a wrong call is 1%, so it's still very likely to be correct.
it would be useful to have a look at the "per sequence quality score", just to know if you have a lot of reads that have an overall very bad quality. it is also produced by fastqc.