Question: Why Does Samtools/Bcftools Give Incorrect Genotypes And Innacurate Quality Scores?
gravatar for Casbon
9.3 years ago by
Casbon3.2k wrote:

I have resequenced a target region in depth and called with samtools with default parameters (except depth). I find the calls to be pretty inaccurate.

For example, two of these samples are from the 1000 genomes pilot 2 YRI family. I have used the 1000 genomes calls to check the genotypes at rs73479087 and we have:

  • NA19238 is a heterozgote
  • NA19239 is a homozygous variant

Now consider samtools output:

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    FORMAT    NA19238    NA19239
chr11    59855960    rs73479087    A    G    110    .    GT:PL:DP:SP:GQ    0/1:46,0,81:142:0:24    0/1:62,4,0:8:0:23

Both are called as hets with very similar genotype qualities (24 vs 23). But lets look closer at the PL column, we have:

  • NA19238 has 46, 0, 81 - pretty strong evidence that this is a het. It is actually sequenced at depth 142 with an allele balance of 55%
  • NA19239 has 62, 4, 0 - which looks like a borderline homozygous variant call. It is actually sequenced at depth 8 with a single reference read.

This seems to show that samtools generates pretty accurate PLs. However, the genotype column is wrong here: why choose a het when the PL for a variant is lower?. Also the genotype quality score bears no relation to the depth or difference between PLs.

So am I doing it wrong, or is samtools/bcftools not producing accurate calls and scores? Is there something between the PLs and the actual calls?


Program: samtools (Tools for alignments in the SAM format) Version: 0.1.18 (r982:295)

Program: bcftools (Tools for data in the VCF/BCF formats) Version: 0.1.17-dev (r973:277)

Command line:

 samtools mpileup -d 1000000 -SD -l $i -uf /data/reference/ucsc/hg19/ucsc.hg19.fasta data.bam | bcftools view -Ibvcg - > bcf.$i.bcf
vcf samtools bcftools genotyping • 8.0k views
ADD COMMENTlink modified 9.3 years ago by lh332k • written 9.3 years ago by Casbon3.2k

Might be worth adding the version of samtools and the command line switches used to get that output.

ADD REPLYlink written 9.3 years ago by Peter5.9k

related post? Why does mpileup skip my mutation ?

ADD REPLYlink modified 8.0 years ago by Istvan Albert ♦♦ 86k • written 9.3 years ago by Pierre Lindenbaum133k
gravatar for lh3
9.3 years ago by
United States
lh332k wrote:

Yes, this is a flaw of bcftools and that is exactly why bcftools does not output genotypes by default.

On the other hand, the whole idea of bcftools is to do inference without knowing genotypes. Bcftools never looks at the genotypes itself infers, but can pretty accurately estimate site allele frequency, compute the LD r^2, discovery very rare mutations and test association and HWE. Actually if you ascertain genotypes without imputation and then do things above, you frequently get much worse results.

So, my suggestion is not to look at genotypes. Bcftools has already computed many statistics you are interested in, only in the right way.

ADD COMMENTlink written 9.3 years ago by lh332k

Well, as I understood it, the point is to avoid genotype calling in low coverage data. Since this it high coverage, this makes bcftools inappropriate?

ADD REPLYlink written 9.3 years ago by Casbon3.2k

I am using SAMTools Version: 1.3 (using htslib 1.3) , How to get GQ score for my variants ? I have used this command. samtools mpileup -DguS -f PMC01-01.bam | bcftools call -vmO z -o PMC01-01.vcf.gz

ADD REPLYlink written 4.7 years ago by always_learning1.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2596 users visited in the last hour