Tool to separate two genotypes from mapped Illumina reads?
2
1
Entering edit mode
9.3 years ago
Adam R ▴ 20

Dear BioStars,

I have a few Illumina datasets where I've seen a puzzling pattern in my mapped reads, and would appreciate some tips as to how to figure out what's going on. These datasets represent bacterial cultures. I have mapped the reads to genomes that are expected to be very closely related -- less than one SNP per kilobase. However, in some regions of the genomes, there are multiple reads that suggest the existence of multiple SNPs, while other reads perfectly match the reference sequence.

These appear to be true SNPs and not sequencing errors for a few reasons:

  1. There exists a distinct set of positions that are bi-allelic. In other words, the same putative SNP shows up in many reads. I have deep coverage (>100x) and these putative SNPs are seen in 20-80% of the reads.
  2. The SNPs appear to be linked -- the non-reference SNPs are found on the same DNA fragment (both within reads and in paired reads)
  3. The reads are not duplicates of each other (different starts and stops, both strands)

My working hypothesis is that my the sequencing library actually contained two different genotypes. In one of my studies, this would be interesting and I would like to get a better sense of what's going on in the population of bacteria. In the other study, I fear that there was cross-contamination between my DNA preps from different cultures, and I would like to understand this better for QC reasons.

So does anyone know of a analytic tool that would help me disentangle these sequences? If I were to make it myself, I would include the following features:

  1. A report of sequence diversity along the genome to identify the locations of these putative SNPs.
  2. A report of the linkage between the SNPs
  3. The sequence of the putative contaminant/invader so that I can figure out where it came from.

Any thoughts would be appreciated. I would also like to hear whether others have encountered these patterns and if there are other possible explanations.

Thanks
Adam

mapping genome bacteria • 2.8k views
ADD COMMENT
1
Entering edit mode

Have you checked the read depth? The strain could have an extra copy not present in the reference genome. Either way, I would try a de novo assembler that does not aggressively collapse bubbles. Things get easier when you have longer contigs (e.g. to identify long divergent haplotypes). Also, it is not uncommon for two closely related strains to have some regions with high divergence.

ADD REPLY
0
Entering edit mode

Thanks for the ideas. Based on read depth, it does not look like a duplication, but I'll probably incorporate that possibility into any systematic analysis that I do. And yes, a less aggressive assembler could allow me to recover both genotypes for the diverse regions.

ADD REPLY
1
Entering edit mode
9.3 years ago

I have a tool (BBSplit) that can split input reads by assigning them to the assembly to which they best map, or both/neither if they map equally well to both. But to use it you would need to obtain and assembly pure samples of each strain. At a 1/1000 SNP rate, I don't see any easy way for you to separate the strains from mixed Illumina-length insert sizes; it would require PacBio data.

It sounds like you have a mixed culture, though 99.9% identity gets pretty close to being the same organism, considering the margins of error in sequencing and assembly. Is the difference kind of distributed throughout the genome, or concentrated in a small area?

ADD COMMENT
0
Entering edit mode

Thanks. BBSplit would work for one of my studies (where one strain may be invading a region with another strain).

The polymorphisms are clustered together, and I was only hoping to link them together within each region (on the order of 1kb in size)

ADD REPLY
0
Entering edit mode
9.1 years ago

Adam,

I've recently started using breseq to track mutations in a culture of mine (I have periodic sequencing data from the culture dating back to establishment two years ago). In general, it's very impressive. Relevant to your problem, it has a "polymorphism" mode that helps identify and call out polymorphic mutations in a mixed population. It uses some sort (haven't explored it for more than a week at this point) of statistical hypothesis testing to determine if differences at a particular base are mutations or sequencing errors, and if they are polymorphic or not. The nice thing about it is that it displays the read alignment evidence for each mutation. You might be able to see if the polymorphisms are linked or not by actually looking at nicely annotated read pileups against the reference sequence and spotting patterns in the ratios or seeing (if you get lucky) two mutations that are linked on a single read.

An example of the output (pictures of read alignment at the bottom of the page):

http://barricklab.org/twiki/pub/Lab/ToolsBacterialGenomeResequencing/documentation/output.html#

On the polymorphism prediction feature:

http://barricklab.org/twiki/pub/Lab/ToolsBacterialGenomeResequencing/documentation/methods.html#polymorphism-prediction

Nathan

P.S. - I read a little on their polymorphism prediction (from the link above) and they use a likelihood-ratio test

ADD COMMENT

Login before adding your answer.

Traffic: 2768 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6