Confusing Definition In Vcfv4.1: Info=<Id=Dp....>
1
3
Entering edit mode
12.2 years ago
Lds ▴ 450

Take one VCF file for example,

##fileformat=VCFv4.1
##samtoolsVersion=0.1.16 (r963:234)
##INFO=<ID=DP,Number=1,Type=Integer,Description="Raw read depth">
##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the site allele frequency of the first ALT allele">
....
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SLVi33.16       SLVi33.25       SLVi33.26
chr1    85691   .       C       T       6.26    .       DP=2;AF1=1;CI95=0.3333,1;DP4=0,0,2,0;MQ=40;FQ=-28.9     GT:PL:GQ        0/1:0,0,0:3     1/1:20,3,0:4    1/1:18,3,0:4
chr1    88338   .       G       A       6.95    .       DP=3;AF1=1;CI95=0.1667,1;DP4=0,0,2,1;MQ=30;FQ=-28.1     GT:PL:GQ        0/1:0,0,0:3     0/1:0,0,0:3     1/1:38,9,0:8

The DP=2 at chr1:85691, however, there are three individuals in the file, why does every individual has genotype? As AF1 means allele frequency of ALT allele, it supposed to be AF1=5/6 at chr1:85691, why it's truely AF1=1? I'm so confused, anybody can help me? Thanks in advance.

vcf • 4.2k views
ADD COMMENT
4
Entering edit mode
12.2 years ago

DP is just the raw read depth at that position. DP of 2 just means 2 reads out of your 3 individual samples mapped to that position and passes the quality filter. It doesn't say whether it is 2 reads from one of your individual or 2 read from 2 individuals..etc.

AF1 of 1 pretty much just means the 2 reads (2/2) both have a T instead of an C at that position.

Why are you expecting it to be 5/6? Are you trying to compare this to some other data?

edit **

Samtools is really not great at outputting and calculating the genotype field. Here is another biostar questions about it.

The PL fields showing 0,0,0 pretty much means there are no reads covering it. It looks like samtools falsely calculates 0,0,0 as heterozygous because the probabilities are similar (all zeroes in this case).

ADD COMMENT
1
Entering edit mode

Thanks so much. You've got the point. When I add flag -D in samtools to call the variants, it keep read depth for each individual. What I get for chr1:85691 is,

chr1    85691   .   C   T   6.26    . DP=2;AF1=1;CI95=0.3333,1;DP4=0,0,2,0;MQ=40;FQ=-28.9   GT:PL:DP:GQ 0/1:0,0,0:0:3 1/1:20,3,0:1:4    1/1:18,3,0:1:4

That's exactly what you said

ADD REPLY
1
Entering edit mode

So if I want to extract the genotype from VCF generated by samtools, I must be very careful about the DP, I should mark the genotype as missing data for samples with DP=0.

ADD REPLY
0
Entering edit mode

If DP of 2 means 2 reads out of 3 individuals, for example, if the 2 reads come from SLVi33.16 and SLVi33.25 instead of SLVi33.26, the genotype symbol for SLVi33.26 should be "./.". But in the file, the genotype symbols are 0/1, 1/1 and 1/1, which means the genotype for these 3 individuals is CT, TT and TT(T5+C1), respectively (and that why I said before AF1=5/6). In other word,we know the genotypes for 3 individuals, the DP should be greater than 6, but it's DP=2 here. How to explain this discordance? Thanks

ADD REPLY
0
Entering edit mode

Yeah I see what you mean. I think this maybe an error on samtools part. Samtools is pretty bad at outputting the genotype information. The fact that the PL fields show 0,0,0 pretty much means there is no read covering it.

ADD REPLY
0
Entering edit mode

Thanks so much. You've got the point. When I add flag -D in samtools to call the variants, it keep read depth for each individual. What I get for chr1:85691 is,

chr1    85691   .   C   T   6.26    .   DP=2;AF1=1;CI95=0.3333,1;DP4=0,0,2,0;MQ=40;FQ=-28.9 GT:PL:DP:GQ 0/1:0,0,0:0:3   1/1:20,3,0:1:4  1/1:18,3,0:1:4

That's exactly what you said

ADD REPLY

Login before adding your answer.

Traffic: 1678 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6