Question

How to tell the mismatches come from sequencing errors or a different haplotype?

0

Entering edit mode

6.9 years ago

chjiao3456 ▴ 40

Hi everyone, I have a next-generation sequencing data set of two haplotypes. One is of 95% and the other one is only 5% of the sequencing data. I used assembly methods to get the dominant haploytpe, and then I mapped the raw reads to the assembled contigs. Since the sequence similarity between two haplotypes is high, I am wondering how can I tell the mismatches between reads and the references coming from sequencing errors or the minor haplotype?

Of course the sequencing error rate and the sequence difference between two haplotypes are different. In addition, sequencing errors tend to be random. However, I do need a probability or statistics model to model this problem and figure out a theoretical sound solution. For example, if multiple reads mapped have the same mismatch on the same position of the reference, this mismatch highly possible comes from the minor haplotype. A hypothesis testing method?

sequencing haplotype mismatches • 1.7k views

ADD COMMENT • link 6.9 years ago by chjiao3456 ▴ 40

0

Entering edit mode

The base quality scores at the location in question are also useful, as well as some other factors like whether the reads indicating the minor allele are properly paired, include different orientations, and various other factors. I suggest you use a variant-caller that models some of these things and get its opinion. If your data has different barcodes for the different haplotypes this should be straightforward, but it's not clear to me what you mean by data with 95% of one haplotype and 5% from the other.

ADD REPLY • link 6.9 years ago by Brian Bushnell 20k

0

Entering edit mode

Thanks for your suggestions. The 95% and 5% mean the percentage of reads number corresponding to the dominant and minor haplotypes, respectively. Unfortunately, there are no barcodes for the two haplotypes.

ADD REPLY • link 6.9 years ago by chjiao3456 ▴ 40

0

Entering edit mode

I guess what I don't understand is how you have this ratio of haplotypes. Is this a combination of two strains of bacteria in a culture, for example?

ADD REPLY • link 6.9 years ago by Brian Bushnell 20k

0

Entering edit mode

Yes, similar like that. Two strains of HIV-1 virus are combined for sequencing.

ADD REPLY • link 6.9 years ago by chjiao3456 ▴ 40

0

Entering edit mode

Ah, I see. Yes, a variant-caller capable of handling low-frequency variants or arbitrary ploidy should help in this situation.

ADD REPLY • link 6.9 years ago by Brian Bushnell 20k