How to interpret vcf FORMAT more accurately?
1
2
Entering edit mode
9.0 years ago
mangfu100 ▴ 800

Hi all.

I am now handling VCF format file to understand its characteristics and to have a wide application.

While managing my vcf files, I have some questions and below is my one of my example of vcf rows.

1       898921  .       C       G,<X>   0       .       DP=211;I16=112,51,1,0,7057,346989,16,256,8061,399183,50,2500,3220,73296,25,625;QS=0.997431,0.00256946,0;SGB=-0.379885;RPB=1;MQB=1;MQSB=0.792466;BQB=1;MQ0F=0    PL:DP:DV:DPR    0,255,255,255,255,255:164:1:163,1,0

As you can see, I used four format types which are PL,DP,DV and DPR and their explanation are as follows

##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of high-quality bases">
##FORMAT=<ID=DV,Number=1,Type=Integer,Description="Number of high-quality non-reference bases">
##FORMAT=<ID=DPR,Number=R,Type=Integer,Description="Number of high-quality bases observed for each allele">

For understanding PL field, my example's PL values are 0,255,255,255,255,255. they are 6 field separated by comma, and I think that the first three of them are used for reference and the others are for alter. To be concrete, in case of 255, which corresponds to 10^(-25.5) (very closely to zero) and the remaining values are same. How can I interpret this formulation? I found that almost values are 255.

Secondly, there are three space separated by comma in DPR field. For example, my example's DPR field is as follows : 163,1,0. From this value, I could know that the first two of them indicate the number of reads which corresponds to each ref/alt. However, what the third columns values which are zero? I didn't get it. Help me!

next-gen genome alignment • 4.8k views
ADD COMMENT
0
Entering edit mode

Just curious, but what software generated this VCF file?

ADD REPLY
2
Entering edit mode
9.0 years ago

In your example, there are two alternate alleles, so three alleles total. Hence, DPR represents number of high-quality bases observed for the reference allele, the first alternate allele, and the second alternate allele. With three alleles, there are six possible genotypes, AA, AB, BB, AC, BC, and CC. The PL represents the phred-scaled likelihoods of each of these genotypes. In your example, the most likely is then the AA genotype.

ADD COMMENT
1
Entering edit mode

Thank you for your response.

one more question!

How did you know that the genotype is AA in my example?

Could you explain a little bit more detail? (I thought maybe the score is zero in PL... right?)

ADD REPLY
0
Entering edit mode

AA means "homozygous reference" in this case. I assume that to be true because that genotype has the highest likelihood and all other genotypes have very low likelihoods.

ADD REPLY

Login before adding your answer.

Traffic: 2801 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6