I have just received data from an NGS run that I suspect was over clustered.
Read 1 is a 24 bp barcode of the following pattern
Following the 24 bp barcode, the sequence should be the same for every read.
Read 2 is a 19 bp sequence that should be a known mutant of a WT promoter. So most of the read 2 sequences should be very similar to each other.
I suspect the run is over clustered because the quality scores for read 1 are poor for the first 24 bp, and then it gets much better. For read 2 the quality scores are much better through out.
If the run was over clustered, it would make sense that the first 24 bp of read 1 would have low quality because at each position there is only a 50 50 chance that the nucleotide is the same. Because the sequence is the same after 24 bp barcode it makes sense that the scores would suddenly improve because an overloaded cluster would suddenly be giving the same signal. Additionally, it makes sense that read 2 would have better scores through out because most of the mutations are single point mutations. So, in general, an over clustered cluster would only be in disagreement 1 or 2 times.
If possible, I would like to salvage this run. Is there a commonly recommended Phred score threshold that I can use to filter reads? If so, should that threshold be the same for both the barcode (R1) and the mutant promoter (R2)?