Question: How to tell the mismatches come from sequencing errors or a different haplotype?
0
gravatar for chjiao3456
3.5 years ago by
chjiao345640
Michigan State University, USA
chjiao345640 wrote:

Hi everyone, I have a next-generation sequencing data set of two haplotypes. One is of 95% and the other one is only 5% of the sequencing data. I used assembly methods to get the dominant haploytpe, and then I mapped the raw reads to the assembled contigs. Since the sequence similarity between two haplotypes is high, I am wondering how can I tell the mismatches between reads and the references coming from sequencing errors or the minor haplotype?

Of course the sequencing error rate and the sequence difference between two haplotypes are different. In addition, sequencing errors tend to be random. However, I do need a probability or statistics model to model this problem and figure out a theoretical sound solution. For example, if multiple reads mapped have the same mismatch on the same position of the reference, this mismatch highly possible comes from the minor haplotype. A hypothesis testing method?

ADD COMMENTlink modified 3.5 years ago • written 3.5 years ago by chjiao345640

The base quality scores at the location in question are also useful, as well as some other factors like whether the reads indicating the minor allele are properly paired, include different orientations, and various other factors. I suggest you use a variant-caller that models some of these things and get its opinion. If your data has different barcodes for the different haplotypes this should be straightforward, but it's not clear to me what you mean by data with 95% of one haplotype and 5% from the other.

ADD REPLYlink written 3.5 years ago by Brian Bushnell17k

Thanks for your suggestions. The 95% and 5% mean the percentage of reads number corresponding to the dominant and minor haplotypes, respectively. Unfortunately, there are no barcodes for the two haplotypes.

ADD REPLYlink written 3.5 years ago by chjiao345640

I guess what I don't understand is how you have this ratio of haplotypes. Is this a combination of two strains of bacteria in a culture, for example?

ADD REPLYlink written 3.5 years ago by Brian Bushnell17k

Yes, similar like that. Two strains of HIV-1 virus are combined for sequencing.

ADD REPLYlink written 3.5 years ago by chjiao345640

Ah, I see. Yes, a variant-caller capable of handling low-frequency variants or arbitrary ploidy should help in this situation.

ADD REPLYlink written 3.5 years ago by Brian Bushnell17k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1269 users visited in the last hour