I am calling variants from ~100 samples jointly with the consensus caller in bcftools. I request the DP4 field (deprecated) which reports the number of reads which support [reference_forward, reference_reverse, variant_forward, variant_reverse]. The genotype provided in the GT fields often disagrees with what is contained in the DP4 field. Example sample 3 and sample 5 have a heterozygous GT despite having respectively only one reference supporting read or no read at all:
CHROM: chrom15 POS: 11169 REF G ALT A FORMAT : GT:PL:DP:SP:DP4:ADF:ADR:AD sample1 0/0:0,9,89:3:0:1,2,0,0:1,0:2,0:3,0 sample2 0/0:0,9,80:3:0:2,1,0,0:2,0:1,0:3,0 sample3 0/1:0,3,25:1:0:1,0,0,0:1,0:0,0:1,0 sample4 0/0:0,6,58:2:0:1,1,0,0:1,0:1,0:2,0 sample5 0/1:0,0,0:0:0:0,0,0,0:0,0:0,0:0,0 sample6 0/1:0,3,26:1:0:0,1,0,0:0,0:1,0:1,0 sample7 1/1:67,6,0:2:0:0,0,0,2:0,0:0,2:0,2 sample8 1/1:54,6,0:2:0:0,0,1,1:0,1:0,1:0,2 sample9 0/1:0,0,0:0:0:0,0,0,0:0,0:0,0:0,0
Below is the genuine line in the the bcf:
chrom15 11169 . G A 999 . DP=268;VDB=0.049104;SGB=10.9066;RPB=0.295564;MQB=0.913243;MQSB=0.000116646;BQB=0.145125;MQ0F=0;AF1=0.524266;G3=0.370806,0.212478,0.416716;HWE=0.000166307;AC1=182;DP4=57,69,59,79;MQ=37;FQ=999;PV4=0.710785,0.0623765,1,1 GT:PL:DP:SP:DP4:ADF:ADR:AD 0/0:0,9,89:3:0:1,2,0,0:1,0:2,0:3,0 0/0:0,9,80:3:0:2,1,0,0:2,0:1,0:3,0 0/1:0,3,25:1:0:1,0,0,0:1,0:0,0:1,0 0/0:0,6,58:2:0:1,1,0,0:1,0:1,0:2,0 0/1:0,0,0:0:0:0,0,0,0:0,0:0,0:0,0 0/1:0,3,26:1:0:0,1,0,0:0,0:1,0:1,0 1/1:67,6,0:2:0:0,0,0,2:0,0:0,2:0,2 1/1:54,6,0:2:0:0,0,1,1:0,1:0,1:0,2 0/1:0,0,0:0:0:0,0,0,0:0,0:0,0:0,0 0/1:0,3,37:1:0:0,1,0,0:0,0:1,0:1,0 etc
I checked in a viewer if the DP4 fields are indeed correct and they are (a single read supporting reference in sample 3, and nothing in sample 5, as well as a correct report for all the other samples). Some samples do support a G->A SNP and those have a GT=1/1 field.
My question is, why bcftools call sets the samples 3 and 5 as heterozygous calls? Is there anything beyond read information that is taken into account that could influence the call this much?
The whole region is not heterozygous and there is no other SNP in the 200 positions before and after this one that could motivate a default call to heterozygous.
The solution is to use the bcftools multi-caller option -m . This seems to solve a large portion of the discrepancy between dp4 tags and GT field. However some problems remain like a heterozygous call with a 0,0,1,0 dp4 field. Maybe the environment of the position has an influence.