I'm trying to figure out where I could go wrong with the following analysis. I'm relatively new to bioinformatics in general and deep sequencing in particular. I haven't found any papers that do what I'm proposing below and I don't have a senior colleague who has been down these roads before. So thanks in advance for your help!
I'm working with poliovirus samples that have been deep sequenced using Illumina with 36 bp reads (not sure which Illumna variant, but Phred range is 0 to 93 which narrows it down). Most of the samples contain one or more of the 3 known vaccine strains, and so it is possible to determine consensus alignments to the known references. With respect to each reference, the consensus alignment is assigned Phred scores.
It's my understanding that the phred scores can be used to identify the closest matching reference. In an "easy" sample, one reference alignment will have phred scores almost universally at 93 (1 in 2 billion per base error probability) while the other two reference alignments have phred scores in non-conserved regions of typically 0 (most common) up to 40. Using the scores this way to type the samples with a single high quality whole-genome match agrees with other assays of poliovirus type.
Regarding samples that contain mixtures (co-infection), there are samples that have 2 maximum quality whole-genome matches to reference type. My understanding is that means that the read depth across the whole genome for both references is more than high enough to claim that both strains are present in the sample.
Regarding recombinants, there are samples where 1 reference has maximal read depth over some large and structurally meaningful segment of the genome while another reference is covered poorly in that region but maximally covered everywhere else. To me, this suggests that the sample is a fixed recombinant of the two types.
My question is: can I use per-base phred scores to detect recombinants and mixtures in this way? I am asking about failure modes of this idea. Is there a known way to get "disjoint" phred scores in different alignments to related sequences by mistake? If anyone knows a reference too, that'd be great!
EDIT: added an example figure
I'm interpreting this as evidence of a polio recombinant near the cis-acting reproduction element (CRE): type 3 on the 5' and type 2 on the 3'. The CREs are highly conserved, as are the many short segments with overlapping maximum scores.