Question

Identifying Recombination Among Known Viruses With Illumina Phred Scores

0

Entering edit mode

11.1 years ago

mikefamulare • 0

Hi all,

I'm trying to figure out where I could go wrong with the following analysis. I'm relatively new to bioinformatics in general and deep sequencing in particular. I haven't found any papers that do what I'm proposing below and I don't have a senior colleague who has been down these roads before. So thanks in advance for your help!

I'm working with poliovirus samples that have been deep sequenced using Illumina with 36 bp reads (not sure which Illumna variant, but Phred range is 0 to 93 which narrows it down). Most of the samples contain one or more of the 3 known vaccine strains, and so it is possible to determine consensus alignments to the known references. With respect to each reference, the consensus alignment is assigned Phred scores.

It's my understanding that the phred scores can be used to identify the closest matching reference. In an "easy" sample, one reference alignment will have phred scores almost universally at 93 (1 in 2 billion per base error probability) while the other two reference alignments have phred scores in non-conserved regions of typically 0 (most common) up to 40. Using the scores this way to type the samples with a single high quality whole-genome match agrees with other assays of poliovirus type.

Regarding samples that contain mixtures (co-infection), there are samples that have 2 maximum quality whole-genome matches to reference type. My understanding is that means that the read depth across the whole genome for both references is more than high enough to claim that both strains are present in the sample.

Regarding recombinants, there are samples where 1 reference has maximal read depth over some large and structurally meaningful segment of the genome while another reference is covered poorly in that region but maximally covered everywhere else. To me, this suggests that the sample is a fixed recombinant of the two types.

My question is: can I use per-base phred scores to detect recombinants and mixtures in this way? I am asking about failure modes of this idea. Is there a known way to get "disjoint" phred scores in different alignments to related sequences by mistake? If anyone knows a reference too, that'd be great!

Thanks again.

EDIT: added an example figure

I'm interpreting this as evidence of a polio recombinant near the cis-acting reproduction element (CRE): type 3 on the 5' and type 2 on the 3'. The CREs are highly conserved, as are the many short segments with overlapping maximum scores.

illumina recombination next-gen • 3.1k views

ADD COMMENT • link updated 6.8 years ago by Biostar 20 • written 11.1 years ago by mikefamulare • 0

score 1 · Answer 1 · 2013-03-22

Short answer: No, that won't work. Have a look at FusionMap or similar software.

Long answer: Back in the days of Sanger sequencing, we used to use tricks like that to find copy number variations on other rearrangements associated with disease (for example, keeping an eye on the chromatogram heights led to this paper). Remember that Sanger sequencing is functionally quite different from the high throughput stuff that you're using. In Sanger sequencing, you have a pool of possibly heterogenous (due to underlying sample heterogeneity, PCR error, non-specific PCR, etc.) that are being sequenced in a pool in the same well. There, if one doesn't have any other issues, large step changes in Phred scores can often mean underlying functional sequence changes (e.g., if the peaks suddenly and consistently go from high quality to what appear to be overlapping good quality sequences, you might have recombination or a copy number variant). With high-throughput sequencing, the individual reads are created from pools that are amplified on the flow cell. The big difference here is that the amplification is from individual fragments that bind to the flow cells. That is to say that the phred scores associated with each base only tell you about the particular fragment in that spot that's being sequenced. To find recombinations, you'd need to see instead if the read maps best as a fusion. I did a quick google and you might find FusionMap to be of use.