I have discovered a problem with underestimating heterozygous sites in my samples.
I have small (~280bp) amplicon sequences that were run on a MiSeq platform on a 250x250 PE run, and demultiplexed by barcodes. The amplicons were isolated with PCR and the barcodes and adaptors were attached with subsequent PCR cycles. I've filtered out reads using the FastQ Toolkit, and created aligned bam files using BWA.
(Note: I have marked duplicates, but since my amplicons were sequenced in full and my aligned PE reads cover the whole amplicon, I don't think I want to exclude duplicates and over amplify PCR errors relative to actual SNPs. I have read both suggestions with respect to including and excluding duplicates and if anyone thinks that excluding them would solve the following problem, any explanation would be greatly appreciated.)
I am using the bam files from BWA to run FreeBayes in the Main public galaxy and call variants into vcf files. As suggested in a number of other sources, heterozygotes are being called when an alternate allele appears in 20%-80% of the reads. I am using these filters on top of filtering at Q20 and additional read map filtering. For the most part the resulting vcf files look great.
I ran a couple samples in replicates and in at least one instance, one replicate called 4 heterozygous sites where the other replicate didn't identify any heterozygous sites. When I look at the bam file in igv, I can see that the alternate alleles present in one replicate are also present in the second, but only in 4% of the reads - therefore not meeting the 20% threshold required to be called an alternate allele. This is consistent between all four sites in question.
I ran negative controls in all of the PCR stages, so I don't think I have any contamination and the only explanations I can conger up are: 1) that there was uneven amplification of one of my haplotype strands during the initial PCR or 2) that there was uneven binding of haplotypes to the flow cell
Has anyone encountered this problem before and have any suggestions as to how I can avoid underestimating the levels of heterozygozity in my samples?