Question: Understanding Vcf File Format
8
8.5 years ago by
Tryingtogetthere180 wrote:

let me start by saying I've spent hours at the http://vcftools.sourceforge.net/specs.html but I still don't understand some of the vcf fields.

For example, In the following record I dont understand some of the GT:AD:DP:GQ:PL information chr1 897723 rs6696911 C T 453.42 PASS AC=1;AF=0.50;AN=2;BaseQRankSum=-0.479;DB;DP=36;Dels=0.00;FS=1.480;HRun=2;HaplotypeScore=0.0000;MQ=41.37;MQ0=0;MQRankSum=-0.578;QD=12.60;ReadPosRankSum=0.842

GT=1/1 I'm pretty sure both allele have T's. Whereas 1/0 would mean hetro for ref and snp? AD = 19,17 - I cant find and explanation what AD means? DP = 36 easy to understand GQ = 87.16 Why are there two values in this field? PL = 483,0,532 - I'm a bit baffled about this field?

thanks for your help, Trying to get there

vcf format • 26k views
modified 8.2 years ago by Jorge Amigo11k • written 8.5 years ago by Tryingtogetthere180

Thanks for the link but I have been there also. That for the AD definition. I understand that PL is phred-scaled genotype likelihood but why are there three values? Thanks

Also why are the two genotype quality values thanks

There is only one GQ value. It is 87.16 (87 and 16/100)

16
8.5 years ago by
Swbarnes21.5k
Swbarnes21.5k wrote:

For a biallelic site, the PL has three numbers, The first one is the probability that the site is homozgyous reference, the second is the probability that the sample is heterzygous, the third that it is homozygous for the alternate letter. The higher the number, the less likely it is that your sample is that genotype. So if your PL is 483,0,532 the software is quite sure that your sample is not homozygous reference or homozygous alternate, it's heterozygous. And the GT shows that, by being 0/1. If the first and last numbers had been lower, then the quality of the SNP woud be poorer, and the genotype would be less confident.

2

I think that means you have 19 reads showing the reference allele, and 17 reads showing the alterante allele. Those do add up to 36, which is your total depth.

Why does the AD have two values

2
8.2 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

from a VCF file generated by GATK's UnifiedGenotyper:

``````##fileformat=VCFv4.1
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
``````

this is not described on the VCF v4.1 format specs, although they do mention "Additional Genotype fields can be defined in the meta-information. However, software support for such fields is not guaranteed."

1
8.2 years ago by
Nick H40
Nick H40 wrote:

AD = Allelic Depth, which is the number of reads that have the reference vs non reference base. In this case 19 ref, 17 alternate.

These two values will usually, but not always sum to the DP value. Reads that are not used for calling are not counted in the DP measure, but are included in AD.

0
8.5 years ago by
Rlong340
US
Rlong340 wrote:

For another reference, try the 1000 genomes site. The AD stands for allele depth, GQ is genotype quality, and that is one float value. PL is the phred-scaled genotype likelihood.