Question: How to interpret vcf FORMAT more accurately?
gravatar for mangfu100
3.9 years ago by
Korea, Republic Of
mangfu100680 wrote:

Hi all.

I am now handling VCF format file to understand its characteristics and to have a wide application.

While managing my vcf files, I have some questions and below is my one of my example of vcf rows.

1       898921  .       C       G,<X>   0       .       DP=211;I16=112,51,1,0,7057,346989,16,256,8061,399183,50,2500,3220,73296,25,625;QS=0.997431,0.00256946,0;SGB=-0.379885;RPB=1;MQB=1;MQSB=0.792466;BQB=1;MQ0F=0    PL:DP:DV:DPR    0,255,255,255,255,255:164:1:163,1,0

As you can see, I used four format types which are PL,DP,DV and DPR and their explanation are as follows

##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Number of high-quality bases">
##FORMAT=<ID=DV,Number=1,Type=Integer,Description="Number of high-quality non-reference bases">
##FORMAT=<ID=DPR,Number=R,Type=Integer,Description="Number of high-quality bases observed for each allele">

For understanding PL field, my example's PL values are 0,255,255,255,255,255. they are 6 field separated by comma, and I think that the first three of them are used for reference and the others are for alter. To be concrete, in case of 255, which corresponds to 10^(-25.5) (very closely to zero) and the remaining values are same. How can I interpret this formulation? I found that almost values are 255.


Secondly, there are three space separated by comma in DPR field. For example, my example's DPR field is as follows : 163,1,0. From this value, I could know that the first two of them indicate the number of reads which corresponds to each ref/alt. However, what the third columns values which are zero? I didn't get it.Help for me!





alignment next-gen genome • 2.7k views
ADD COMMENTlink modified 3.9 years ago by Sean Davis25k • written 3.9 years ago by mangfu100680

Just curious, but what software generated this VCF file?

ADD REPLYlink written 3.9 years ago by Sean Davis25k
gravatar for Sean Davis
3.9 years ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

In your example, there are two alternate alleles, so three alleles total.  Hence, DPR represents number of high-quality bases observed for the reference allele, the first alternate allele, and the second alternate allele.  With three alleles, there are six possible genotypes, AA, AB, BB, AC, BC, and CC.  The PL represents the phred-scaled likelihoods of each of these genotypes.  In your example, the most likely is then the AA genotype.  

ADD COMMENTlink written 3.9 years ago by Sean Davis25k

Thank you for your response.

one more question!

How did you know that the genotype is AA in my example?

Could you explain a little bit more detail? (I thought maybe the score is zero in PL... right?)

ADD REPLYlink modified 3.9 years ago • written 3.9 years ago by mangfu100680

AA means "homozygous reference" in this case.  I assume that to be true because that genotype has the highest likelihood and all other genotypes have very low likelihoods.

ADD REPLYlink written 3.9 years ago by Sean Davis25k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 789 users visited in the last hour