How to interpret vcf FORMAT more accurately?
1
2
Entering edit mode
7.9 years ago
mangfu100 ▴ 800

Hi all.

I am now handling VCF format file to understand its characteristics and to have a wide application.

While managing my vcf files, I have some questions and below is my one of my example of vcf rows.

1       898921  .       C       G,<X>   0       .       DP=211;I16=112,51,1,0,7057,346989,16,256,8061,399183,50,2500,3220,73296,25,625;QS=0.997431,0.00256946,0;SGB=-0.379885;RPB=1;MQB=1;MQSB=0.792466;BQB=1;MQ0F=0    PL:DP:DV:DPR    0,255,255,255,255,255:164:1:163,1,0


As you can see, I used four format types which are PL,DP,DV and DPR and their explanation are as follows

##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of high-quality bases">
##FORMAT=<ID=DV,Number=1,Type=Integer,Description="Number of high-quality non-reference bases">
##FORMAT=<ID=DPR,Number=R,Type=Integer,Description="Number of high-quality bases observed for each allele">


For understanding PL field, my example's PL values are 0,255,255,255,255,255. they are 6 field separated by comma, and I think that the first three of them are used for reference and the others are for alter. To be concrete, in case of 255, which corresponds to 10^(-25.5) (very closely to zero) and the remaining values are same. How can I interpret this formulation? I found that almost values are 255.

Secondly, there are three space separated by comma in DPR field. For example, my example's DPR field is as follows : 163,1,0. From this value, I could know that the first two of them indicate the number of reads which corresponds to each ref/alt. However, what the third columns values which are zero? I didn't get it. Help me!

next-gen genome alignment • 4.3k views
0
Entering edit mode

Just curious, but what software generated this VCF file?

2
Entering edit mode
7.9 years ago

In your example, there are two alternate alleles, so three alleles total. Hence, DPR represents number of high-quality bases observed for the reference allele, the first alternate allele, and the second alternate allele. With three alleles, there are six possible genotypes, AA, AB, BB, AC, BC, and CC. The PL represents the phred-scaled likelihoods of each of these genotypes. In your example, the most likely is then the AA genotype.

1
Entering edit mode

one more question!

How did you know that the genotype is AA in my example?

Could you explain a little bit more detail? (I thought maybe the score is zero in PL... right?)

0
Entering edit mode

AA means "homozygous reference" in this case. I assume that to be true because that genotype has the highest likelihood and all other genotypes have very low likelihoods.