I am using MiSeq to sequence parasite samples and identify multiple genetic strains within a sample. I wanted to sequence the full length of a particular gene, which is 1000bp, so I designed primers to cut it up into 4 overlapping fragments. The problem is, I don’t know which fragment belongs with which, to make up the full sequence length and represent a genetic strain.
I have tried blasting the fragments as there are reference strains in GenBank, but some fragments are more conserved than others and have multiple 100% matches whereas other fragments are more variable and have no 100% matches (perhaps representing new strains).
I have aligned the sequences from all four fragments and looked at the regions of overlap but if there is a 100% match in this region, it does not necessarily mean they are identical in the non- overlapping region.
Treating the genetic strains as haplotypes and inferring their frequencies from pooled samples is the approach that seems most promising however this is based on SNP frequencies. My issue is a) there can be more than one alternative SNP at any given position and b) obtaining SNP frequencies across the full length of the sequence is problematic with the overlapping regions and not counting the SNPs here twice.
Any thoughts with the above would be hugely appreciated!