I've aligned pacbio ccs reads to a small reference sequence using minimap2 and viewed the result in IGV. The alignments look good, with generally very few discrepancies with the reference. There is one region, however where more variations are observed and that is at a string of 13 Cs. This is not unexpected but when viewing it in IGV I'm actually impressed by how minimal it appears to be.
Note that the reads aligned in this snapshot are typical of the rest of the alignment as well and that the center line marks the beginning of the polyC string
Subsequently I attempted to call variants on this bam file using a variety of variant caller, callvariants (BBMAP), freebayes, and HaplotypeCaller (gatk). All call a 1bp deletion of a C at the poly C string.
callvariants #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT MIR604-F2-M2-1A.ccs.10pct.fastq-vs-MIR604.5901-10938.fa.minimap2.sort MIR604.5901-10938 1401 . Tc T 23.72 PASS SN=0;STA=1401;STO=1402;TYP=DEL;R1P=340;R1M=281;R2P=0;R2M=0;AD=621;DP=1000;MCOV=-1;PPC=0;AF=0.6210;RAF=0.6210;LS=3013780;MQS=37260;MQM=60;BQS=21508;BQM=41;EDS=960160;EDM=4471;IDS=619759;IDM=999;NVC=0;FLG=0;CED=601;HMP=6;SB=0.9997 GT:DP:AD:AF:RAF:NVC:FLG:SB:SC:PF 1:1000:621:0.6210:0.6210:0:0:0.9997:23.72:PASS freebayes #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT unknown MIR604.5901-10938 1401 . TC T 20562.3 . AB=0.578755;ABP=61.8389;AC=2;AF=0.666667;AN=3;AO=632;CIGAR=1M1D;DP=1092;DPRA=0;EPP=10.9266;EPPR=20.4059;GTI=0;HWE=-0;LEN=1;MEANALT=20;MQM=60;MQMR=60;NS=1;NUMALT=1;ODDS=371.527;PAIRED=0;PAIREDR=0;RO=364;RPP=626.539;RPPR=393.971;RUN=1;SAP=16.2178;SRP=20.4059;TYPE=del;XAI=0.00135563;XAM=0.00142991;XAS=7.42841e-05;XRI=0.00134239;XRM=0.00140067;XRS=5.82842e-05;BVAR GT:DP:RO:QR:AO:QA 0/1/1:1092:364:32027:632:23740 HaplotypeCaller #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT MIR604-F2-M2-1A MIR604.5901-10938 1401 . TC T 2899.01 . AC=1;AF=1.00;AN=1;BaseQRankSum=0.922;DP=342;FS=5.456;MLEAC=1;MLEAF=1.00;MQ=60.00;MQRankSum=0.000;QD=8.48;ReadPosRankSum=-0.229;SOR=0.463 GT:AD:DP:GQ:PL 1:119,223:342:99:2909,0
All three variant callers call a 1 bp deletion at the homopolymer. What is curious though is the apparent frequency of the alternate allele each of these tools perceive compared to what is visible in IGV. IGV indicates that variants at this position are relatively rare (1-2%) where as each of these callers is reporting the variant is observed in the majority of reads (~60%). Any ideas why?