Hi all,
I'm trying to figure out where I could go wrong with the following analysis. I'm relatively new to bioinformatics in general and deep sequencing in particular. I haven't found any papers that do what I'm proposing below and I don't have a senior colleague who has been down these roads before. So thanks in advance for your help!
I'm working with poliovirus samples that have been deep sequenced using Illumina with 36 bp reads (not sure which Illumna variant, but Phred range is 0 to 93 which narrows it down). Most of the samples contain one or more of the 3 known vaccine strains, and so it is possible to determine consensus alignments to the known references. With respect to each reference, the consensus alignment is assigned Phred scores.
It's my understanding that the phred scores can be used to identify the closest matching reference. In an "easy" sample, one reference alignment will have phred scores almost universally at 93 (1 in 2 billion per base error probability) while the other two reference alignments have phred scores in non-conserved regions of typically 0 (most common) up to 40. Using the scores this way to type the samples with a single high quality whole-genome match agrees with other assays of poliovirus type.
Regarding samples that contain mixtures (co-infection), there are samples that have 2 maximum quality whole-genome matches to reference type. My understanding is that means that the read depth across the whole genome for both references is more than high enough to claim that both strains are present in the sample.
Regarding recombinants, there are samples where 1 reference has maximal read depth over some large and structurally meaningful segment of the genome while another reference is covered poorly in that region but maximally covered everywhere else. To me, this suggests that the sample is a fixed recombinant of the two types.
My question is: can I use per-base phred scores to detect recombinants and mixtures in this way? I am asking about failure modes of this idea. Is there a known way to get "disjoint" phred scores in different alignments to related sequences by mistake? If anyone knows a reference too, that'd be great!
Thanks again.
EDIT: added an example figure
I'm interpreting this as evidence of a polio recombinant near the cis-acting reproduction element (CRE): type 3 on the 5' and type 2 on the 3'. The CREs are highly conserved, as are the many short segments with overlapping maximum scores.
Thanks for the informative reply. Above all else, thanks for pointing out that the what I really need to do is to see how reads map as fusions. The reference to FusionMap was helpful and my colleague and I are looking at it and related tools now.
If you've got a little more time, I want to better understand why phred score comparison doesn't work. I attached an example figure from my data. I understand everything you mentioned about Sanger sequencing, and that the illumina phred scores alone only give local information. But, if I aggregate across segments of the genome, observe common changepoints across different reference sequences, and account for conserved regions in the interpretation, why is that not sufficient to identify a recombinant?
Thanks again.
From the figure that you uploaded, it looks like you're actually using Sanger sequencing. I'm not aware of any Illumina machines that produce scores above ~41. Those also seem to be scores from an alignment or assembly, rather than the more standard (at least that I'm used to) read base phred scores. So maybe it'd be helpful (either for me or for someone else to be able to jump in) to know more about how your reads were made and that graph was constructed.