Question: Identifying Recombination Among Known Viruses With Illumina Phred Scores
gravatar for mikefamulare
7.6 years ago by
mikefamulare0 wrote:

Hi all,

I'm trying to figure out where I could go wrong with the following analysis. I'm relatively new to bioinformatics in general and deep sequencing in particular. I haven't found any papers that do what I'm proposing below and I don't have a senior colleague who has been down these roads before. So thanks in advance for your help!

I'm working with poliovirus samples that have been deep sequenced using Illumina with 36 bp reads (not sure which Illumna variant, but Phred range is 0 to 93 which narrows it down). Most of the samples contain one or more of the 3 known vaccine strains, and so it is possible to determine consensus alignments to the known references. With respect to each reference, the consensus alignment is assigned Phred scores.

It's my understanding that the phred scores can be used to identify the closest matching reference. In an "easy" sample, one reference alignment will have phred scores almost universally at 93 (1 in 2 billion per base error probability) while the other two reference alignments have phred scores in non-conserved regions of typically 0 (most common) up to 40. Using the scores this way to type the samples with a single high quality whole-genome match agrees with other assays of poliovirus type.

Regarding samples that contain mixtures (co-infection), there are samples that have 2 maximum quality whole-genome matches to reference type. My understanding is that means that the read depth across the whole genome for both references is more than high enough to claim that both strains are present in the sample.

Regarding recombinants, there are samples where 1 reference has maximal read depth over some large and structurally meaningful segment of the genome while another reference is covered poorly in that region but maximally covered everywhere else. To me, this suggests that the sample is a fixed recombinant of the two types.

My question is: can I use per-base phred scores to detect recombinants and mixtures in this way? I am asking about failure modes of this idea. Is there a known way to get "disjoint" phred scores in different alignments to related sequences by mistake? If anyone knows a reference too, that'd be great!

Thanks again.

EDIT: added an example figure

I'm interpreting this as evidence of a polio recombinant near the cis-acting reproduction element (CRE): type 3 on the 5' and type 2 on the 3'. The CREs are highly conserved, as are the many short segments with overlapping maximum scores.

ADD COMMENTlink modified 3.3 years ago by Biostar ♦♦ 20 • written 7.6 years ago by mikefamulare0
gravatar for Devon Ryan
7.6 years ago by
Devon Ryan97k
Freiburg, Germany
Devon Ryan97k wrote:

Short answer: No, that won't work. Have a look at FusionMap or similar software.

Long answer: Back in the days of Sanger sequencing, we used to use tricks like that to find copy number variations on other rearrangements associated with disease (for example, keeping an eye on the chromatogram heights led to this paper). Remember that Sanger sequencing is functionally quite different from the high throughput stuff that you're using. In Sanger sequencing, you have a pool of possibly heterogenous (due to underlying sample heterogeneity, PCR error, non-specific PCR, etc.) that are being sequenced in a pool in the same well. There, if one doesn't have any other issues, large step changes in Phred scores can often mean underlying functional sequence changes (e.g., if the peaks suddenly and consistently go from high quality to what appear to be overlapping good quality sequences, you might have recombination or a copy number variant). With high-throughput sequencing, the individual reads are created from pools that are amplified on the flow cell. The big difference here is that the amplification is from individual fragments that bind to the flow cells. That is to say that the phred scores associated with each base only tell you about the particular fragment in that spot that's being sequenced. To find recombinations, you'd need to see instead if the read maps best as a fusion. I did a quick google and you might find FusionMap to be of use.

ADD COMMENTlink written 7.6 years ago by Devon Ryan97k

Thanks for the informative reply. Above all else, thanks for pointing out that the what I really need to do is to see how reads map as fusions. The reference to FusionMap was helpful and my colleague and I are looking at it and related tools now.

If you've got a little more time, I want to better understand why phred score comparison doesn't work. I attached an example figure from my data. I understand everything you mentioned about Sanger sequencing, and that the illumina phred scores alone only give local information. But, if I aggregate across segments of the genome, observe common changepoints across different reference sequences, and account for conserved regions in the interpretation, why is that not sufficient to identify a recombinant?

Thanks again.

ADD REPLYlink written 7.6 years ago by mikefamulare0

From the figure that you uploaded, it looks like you're actually using Sanger sequencing. I'm not aware of any Illumina machines that produce scores above ~41. Those also seem to be scores from an alignment or assembly, rather than the more standard (at least that I'm used to) read base phred scores. So maybe it'd be helpful (either for me or for someone else to be able to jump in) to know more about how your reads were made and that graph was constructed.

ADD REPLYlink written 7.6 years ago by Devon Ryan97k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1285 users visited in the last hour