Hi all,
I am looking at the phred-score in the sequencing data, where I try to look at the correlation between phred-score of each base to the other to remove the low quality base.
Does the phred-score is affected by the position of the base in all the reads (vertical), or all the bases that belong to a single read (horizontal)?
if it is vertically correlation, is that make any sense if I choose only part of the reads, for e.g position 20 -> 40, to control the phred-score as other position have low quality phred-score?
what do you mean by 'low nucleotide diversity'. If it is low nucleotide diversity, is it supposed to have consistent signals, leading to high and trustable phred scores?
Illumina sequencing assumes/generally expects that clusters in a sequencing field have an even distribution of ACTG so for every sequencing cycle not every cluster shows fluorescence. Basecalling/spot registration software can get confused (if every cluster/spot fluoresces) in case the sequenced base in a cycle is the same for every cluster (which can happen if you are sequencing amplicons). Remember that these clusters are microns apart from each other. This can lead to lowering of Q scoes for basecalls, if low nucleotide diversity is present.
More here --> https://support-docs.illumina.com/SHARE/ClusterOptimize/Content/SHARE/ClusterOptimize/NucleotideDiversity.htm and https://emea.support.illumina.com/bulletins/2016/07/what-is-nucleotide-diversity-and-why-is-it-important.html/1000
thank you for explanation and documentation