I have resequenced a target region in depth and called with samtools with default parameters (except depth). I find the calls to be pretty inaccurate.
For example, two of these samples are from the 1000 genomes pilot 2 YRI family. I have used the 1000 genomes calls to check the genotypes at rs73479087 and we have:
- NA19238 is a heterozgote
- NA19239 is a homozygous variant
Now consider samtools output:
#CHROM POS ID REF ALT QUAL FILTER FORMAT NA19238 NA19239 chr11 59855960 rs73479087 A G 110 . GT:PL:DP:SP:GQ 0/1:46,0,81:142:0:24 0/1:62,4,0:8:0:23
Both are called as hets with very similar genotype qualities (24 vs 23). But lets look closer at the PL column, we have:
- NA19238 has 46, 0, 81 - pretty strong evidence that this is a het. It is actually sequenced at depth 142 with an allele balance of 55%
- NA19239 has 62, 4, 0 - which looks like a borderline homozygous variant call. It is actually sequenced at depth 8 with a single reference read.
This seems to show that samtools generates pretty accurate PLs. However, the genotype column is wrong here: why choose a het when the PL for a variant is lower?. Also the genotype quality score bears no relation to the depth or difference between PLs.
So am I doing it wrong, or is samtools/bcftools not producing accurate calls and scores? Is there something between the PLs and the actual calls?
Program: samtools (Tools for alignments in the SAM format) Version: 0.1.18 (r982:295)
Program: bcftools (Tools for data in the VCF/BCF formats) Version: 0.1.17-dev (r973:277)
samtools mpileup -d 1000000 -SD -l $i -uf /data/reference/ucsc/hg19/ucsc.hg19.fasta data.bam | bcftools view -Ibvcg - > bcf.$i.bcf