Background: The CFTR gene has a Poly TG/Poly T region in intron 8 that has varying clinical consequences. I'm using targeted NGS for this region (150bp, paried end) and I'm trying to call the PolyTG/PolyT genotypes from fastq files. My first attempt was bbmap run with default parameters and a reference including (TG)9-12(T)5,7,9:
bbmap.sh ref=$ref in1=$R1 in2=$R2 scafstats=$out
39/40 samples could be correctly called based on the bbmap stats (header shown below). Heterozygous calls were made from ref_name's with the highest unambiguous reads. Homozygotes could be figured based on the ratio of the top two.
#sample ref_name per_unambiguousReads unambiguousMB per_ambiguousReads ambiguousMB unambiguousReads ambiguousReads assignedReads assignedBases
I expanded the reference to catch rare genotypes (e.g. T6 and TG13). The expanded reference has (TG)8-13(T)2-10. Running bbmap with this reference results in many incorrect calls.
I'm wondering if there are parameters I should be tweaking in bbmap or if there is a better solution to calling the Poly TG/Poly T genotypes that I'm not thinking of.