I have a question regarding the way the Illumina pipeline generates its quality check status in the qseq files (11th column according to information from here: http://jumpgate.caltech.edu/wiki/QSeq):
Please take a look at this (representative) example (I've removed the machine ID):
1st paired-end read: HWUSI-XXXXXX 11 7 120 19847 19200 0 1 .AATGATATAGAATGGAATTGAATGGAATGTGCGTGAATGGAATG BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB 1 2nd paired-end read: HWUSI-XXXXXX 11 7 120 19847 19200 0 3 TCTATTCCTTTCTAATCCATTCAATTCCATTTCATTCGATTCCAT hfhghhcghhhghhhfff]ffhgchhhcghfheehdfdfafffff 1
According to my interpretation of the qseq data format, the 1st paired-end read has passed Illumina QC ("1" in last column of the line), even though the whole read should be disregarded according to PHRED score B(=2). How is it then possible that this read passed the QC? This is one pair of a paired end read, and the matching read from the second file has actually passed the QC and does have a better overall PHRED score (see above) - could this be the reason? I.e. does the Illumina pipeline consider the "overall" quality of a sequence if it is a pair-ended read?
My issue is that nearly 10% of the reads fall into this category (QC passed, yet Bs for all positions). At this stage I am planning to remove these reads prior to alignment, but I would appreciate some comments/answers from people who have seen similar reads in their experiments.
Thanks in advance!