Hi everyone!
I'm new here, so don't hesitate to let me know if I need to improve my post in anyway.
Although I know that bcftools is not meant for polyploid scenarios, I have never managed to make somatic SNP callers like GATK's mutect2 work properly in the past, so I'm trying to tweak the conventional bcftools pipeline (mpileup, call, filter) so that I can at least get lower frequency alternative alleles. In general, I think I managed to figure it out how to do so by running "call" with a p-value threshold of 1 and then applying a filter over the allelic depth ratios (plus some minimum coverage threshold to reduce the risk of spurious calls). Since my scenario is polyploid-like (working on the rDNA, so a huge number of potentially distinct paralogous copies exist), I wanted to leverage the multiallelic caller mode, but I find myself a bit confused about some of its decisions, and I would really appreciate if anyone could help me understand. In particular, I don't know why the first of the following positions is considered a 0/1 genotype and the second a 1/2 (here using -A option to keep all alleles with associated reads):
KY962518.1 213 . T C,G 72 . DP=1226;VDB=0.396811;SGB=-0.693147;RPB=0.999284;MQB=0.971926;MQSB=0.0138092;BQB=0.457824;MQ0F=0;ICB=1;HOB=0.5;AC=1,0;AN=2;DP4=602,175,342,72;MQ=41 GT:PL:AD 0/1:106,0,255,141,98,255:777,221,193
KY962518.1 4508 . T C,G 7.04572 . DP=579;VDB=0.14821;SGB=-0.693147;RPB=0.451336;MQB=0.0796627;MQSB=0.971006;BQB=0.0137722;MQ0F=0.00172712;AC=1,1;AN=2;DP4=146,231,51,76;MQ=41 GT:PL:AD 1/2:68,56,248,69,0,251:377,67,60
Since it might be difficult to read such long and packed lines, the relevant bit is that position 213 is considered 0/1 (heterozygous reference and alternative) with 777 reference reads, 221 first alternative reads and 193 second alternative reads, while position 4508 is considered 1/2 (heterozygous for the two alternatives) with 377 reference reads, 67 first alternative reads and 60 second alternative reads.
This means that if I remove the -A option to ignore actually spurious read counts, such as the single G read in the following:
KY962518.1 347 . C T,G 228 PASS DP=1571;VDB=0.315109;SGB=-0.693147;RPB=0.990032;MQB=0.962798;MQSB=0.657726;BQB=0.312532;MQ0F=0;AC=2,0;AN=2;DP4=116,61,732,578;MQ=41 GT:PL:AD 1/1:255,187,0,255,255,255:177,1309,1
I also lose the calls for G in 213, but I retain both alternatives in 4508 (which originally made me think that 4508 was the only multiallelic position in this sample):
KY962518.1 213 . T C 72 . DP=1226;VDB=0.396811;SGB=-0.693147;RPB=0.999284;MQB=0.971926;MQSB=0.0138092;BQB=0.457824;MQ0F=0;ICB=1;HOB=0.5;AC=1;AN=2;DP4=602,175,342,72;MQ=41 GT:PL:AD 0/1:106,0,255:777,221
KY962518.1 4508 . T C,G 7.04572 . DP=579;VDB=0.14821;SGB=-0.693147;RPB=0.451336;MQB=0.0796627;MQSB=0.971006;BQB=0.0137722;MQ0F=0.00172712;AC=1,1;AN=2;DP4=146,231,51,76;MQ=41 GT:PL:AD 1/2:68,56,248,69,0,251:377,67,60
even if the genotype is supposed to be 1/2 instead of 0/1/2, as I think it should be in both cases, I can still retrieve and process appropriately if I retain all "real" alleles and their counts. Tweaking the p-value threshold doesn't seem to help at all in this case.
Does anyone know why the genotypes are called the way they are and if it could be possible to "encourage" them to become 1/2 in this kind of situations? Otherwise, can anyone think of a way of post-filtering particular alleles instead of entire positions if some requirements aren't met? I am also open to suggestions for other pipelines that could help me address this issue.
Thank you so much!