Understanding Vcf File Format
4
8
Entering edit mode
10.8 years ago

let me start by saying I've spent hours at the http://vcftools.sourceforge.net/specs.html but I still don't understand some of the vcf fields.

For example, In the following record I dont understand some of the GT:AD:DP:GQ:PL information chr1 897723 rs6696911 C T 453.42 PASS AC=1;AF=0.50;AN=2;BaseQRankSum=-0.479;DB;DP=36;Dels=0.00;FS=1.480;HRun=2;HaplotypeScore=0.0000;MQ=41.37;MQ0=0;MQRankSum=-0.578;QD=12.60;ReadPosRankSum=0.842

GT=1/1 I'm pretty sure both allele have T's. Whereas 1/0 would mean hetro for ref and snp? AD = 19,17 - I cant find and explanation what AD means? DP = 36 easy to understand GQ = 87.16 Why are there two values in this field? PL = 483,0,532 - I'm a bit baffled about this field?

thanks for your help, Trying to get there

vcf format • 30k views
0
Entering edit mode

Thanks for the link but I have been there also. That for the AD definition. I understand that PL is phred-scaled genotype likelihood but why are there three values? Thanks

0
Entering edit mode

Also why are the two genotype quality values thanks

0
Entering edit mode

There is only one GQ value. It is 87.16 (87 and 16/100)

16
Entering edit mode
10.8 years ago
Swbarnes2 ★ 1.5k

For a biallelic site, the PL has three numbers, The first one is the probability that the site is homozgyous reference, the second is the probability that the sample is heterzygous, the third that it is homozygous for the alternate letter. The higher the number, the less likely it is that your sample is that genotype. So if your PL is 483,0,532 the software is quite sure that your sample is not homozygous reference or homozygous alternate, it's heterozygous. And the GT shows that, by being 0/1. If the first and last numbers had been lower, then the quality of the SNP woud be poorer, and the genotype would be less confident.

2
Entering edit mode

I think that means you have 19 reads showing the reference allele, and 17 reads showing the alterante allele. Those do add up to 36, which is your total depth.

0
Entering edit mode

Why does the AD have two values

2
Entering edit mode
10.5 years ago

from a VCF file generated by GATK's UnifiedGenotyper:

##fileformat=VCFv4.1
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">


this is not described on the VCF v4.1 format specs, although they do mention "Additional Genotype fields can be defined in the meta-information. However, software support for such fields is not guaranteed."

1
Entering edit mode
10.5 years ago
Nick H ▴ 40

AD = Allelic Depth, which is the number of reads that have the reference vs non reference base. In this case 19 ref, 17 alternate.

These two values will usually, but not always sum to the DP value. Reads that are not used for calling are not counted in the DP measure, but are included in AD.

0
Entering edit mode
10.8 years ago
Rlong ▴ 340

For another reference, try the 1000 genomes site. The AD stands for allele depth, GQ is genotype quality, and that is one float value. PL is the phred-scaled genotype likelihood.