Question: Qual Scores In 1000 Genomes Vcf File
gravatar for Simon
8.7 years ago by
Simon40 wrote:


In the 1000 genomes VCF files, QUAL represents "a phred-scaled quality score for the assertion made in ALT". Does anybody know how they actually calculated this and what factors they consider?

Thanks for any help!

genome • 7.0k views
ADD COMMENTlink modified 7 months ago by Biostar ♦♦ 20 • written 8.7 years ago by Simon40
gravatar for Laura
8.7 years ago by
Cambridge UK
Laura1.7k wrote:

The majority of our SNP and Indel sites are assess using the Variant Quality Score Recalibrator from the Broad's GATK

You should find the papers both about GATK and VQSR useful to explain these things

ADD COMMENTlink written 8.7 years ago by Laura1.7k
gravatar for Jonathan Crowther
6.4 years ago by
Jonathan Crowther200 wrote:

The VCF QUAL score is simply the Phred scales quality score.

  • Phred Quality score (Q)
  • Probability that a base is incorrectly called (P)

The formula you require are

Q= -10(Log10P)
P= 10**(-Q/10)  ** indicates to the power

So If you take the following as an example:

A Phred quality of 30 indicates a probability of 1/1000 chance the base has been called incorrectly.

so Q=30 and P=1/1000

30= -10(Log10(1/1000))



I hope this helps in some way.

ADD COMMENTlink modified 9 months ago by RamRS30k • written 6.4 years ago by Jonathan Crowther200
gravatar for María
8.7 years ago by
María20 wrote:


I found this in the GATK paper.

"In brief, our example genotyper computes the posterior probability of each genotype, given the pileup of sequencer reads that cover the current locus, and expected heterozygosity of the sample. This computation is used to derive the prior probability each of the possible 10 diploid genotypes, using the Bayesian formulation (Shoemaker et al. 1999)

[Formula here]

where D represents our data (the read base pileup at this reference base) and G represents the given genotype. The term p(G) is the prior probability of seeing this genotype, which is influenced by its identity as a homozygous reference, heterozygous, or homozygous nonreference genotype. The value p(D) is constant over all genotypes, and can be ignored, and

[another formula here]

where b represents each base covering the target locus. The probability of each base given the genotype is defined as [even one more formulas here], when the genotype G = {Aa,A2} is decomposed into its two alleles. The probability of seeing a base given an allele is

and the epsilon term e is the reversed phred scaled quality score at the base. Finally, the assigned genotype at each site is the genotype with the greatest posterior probability, which is emitted to disk if its log-odds score exceeds a set threshold."

So in my understanding they take the depth and base quality into this estimation.


ADD COMMENTlink written 8.7 years ago by María20
gravatar for Jorge Amigo
8.7 years ago by
Jorge Amigo12k
Santiago de Compostela, Spain
Jorge Amigo12k wrote:

since 1000 genomes calls are GATK based, aside from the readings that Laura suggests, I would highly recommend to dig into GATK's site and extract valuable information from it:

  • the unified genotyper and its quality score calculation are described in the proper variant calling algorithm page, which should strictly answer your question about the score and its formula.
  • also, it's very useful to know that GATK can consider a set of known variant sites in order to perform a base quality score recalibration, which would ultimately help the previously described algorithm
  • finally, there are useful recommendations for variant detection, which include things like marking/removing duplicated reads, realigning around indels, or the recalibration mentioned above.
ADD COMMENTlink written 8.7 years ago by Jorge Amigo12k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1667 users visited in the last hour