Question: How do genotype likelihoods (PL) work in gVCF files?
gravatar for popantrop
12 months ago by
popantrop40 wrote:

I ran GATK4 on a single WES BAM file from The Cancer Genome Atlas (TCGA) program.

This is how an example SNP call looks like:

chr22   10685693    .   A   G,<NON_REF> 307.60  .   BaseQRankSum=1.267;DP=52;ExcessHet=3.0103;MLEAC=1,0;MLEAF=0.500,0.00;MQRankSum=-4.582;RAW_MQandDP=91525,52;ReadPosRankSum=0.279 GT:AD:DP:GQ:PL:SB   0/1:34,18,0:52:99:315,0,697,414,751,1165:18,16,1,17

As you can see the PL field features six comma-separated integers: 315,0,697,414,751,1165. My understanding derived from the answer here is that this should correspond to the following (phred-adjusted) genotype likelihoods (N corresponds to NON_REF):

AA: 315
AG: 0
GG: 697
AN: 414
GN: 751
NN: 1165

It appears we have overwhelming evidence of a heterozygous SNP A/G: 34 calls for A, 18 calls for G and 0 calls for anything else. This data makes sense if you follow the formula for computing PL as shown here.

Briefly, probability of genotype given data is going to be very high (almost 1) for AG, and $-10* log(1) = 0$. Conversely it will be very low for CC, CT or TT (NON_REF), in the order of $10^-100$, and $-10 * log(10^-100) ~ 1165$. Given allelic imbalance of A vs G, it also makes sense that AA is more likely than GG.


As pointed out in the comments, gVCF file shouldn't be used in downstream analysis, one can combine (multiple) gVCF files(s) into finished genotypes using the GenotypeGVCFs command of gatk. This gets rid of NON_REF SNPs, with the result, which looks like this:

chr22   10685693        .       A       G       307.60  .       AC=1;AF=0.500;AN=2;BaseQRankSum=1.27;DP=52;ExcessHet=3.0103;FS=31.510;MLEAC=1;MLEAF=0.500;MQ=41.95;MQRankSum=-4.582e+00;QD=5.92;ReadPosRankSum=0.279;SOR=4.04  GT:AD:DP:GQ:PL  0/1:34,18:52:99:315,0,697

With the PL field corresponding to:

AA: 315
AG: 0
GG: 697
snp genotype vcf • 398 views
ADD COMMENTlink modified 12 months ago • written 12 months ago by popantrop40

you're currently looking at a *.g.vcf file, you'd better call GenotypeGVcf before looking at those values...

ADD REPLYlink written 12 months ago by Pierre Lindenbaum129k

Thanks, that sounds like good advise!

ADD REPLYlink written 12 months ago by popantrop40
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1690 users visited in the last hour