Question: Confusing Definition In Vcfv4.1: Info=<Id=Dp....>
3
8.5 years ago by
Lds420
Lds420 wrote:

Take one VCF file for example,

``````##fileformat=VCFv4.1
##samtoolsVersion=0.1.16 (r963:234)
##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the site allele frequency of the first ALT allele">
....
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SLVi33.16       SLVi33.25       SLVi33.26
chr1    85691   .       C       T       6.26    .       DP=2;AF1=1;CI95=0.3333,1;DP4=0,0,2,0;MQ=40;FQ=-28.9     GT:PL:GQ        0/1:0,0,0:3     1/1:20,3,0:4    1/1:18,3,0:4
chr1    88338   .       G       A       6.95    .       DP=3;AF1=1;CI95=0.1667,1;DP4=0,0,2,1;MQ=30;FQ=-28.1     GT:PL:GQ        0/1:0,0,0:3     0/1:0,0,0:3     1/1:38,9,0:8
``````

The DP=2 at chr1:85691, however, there are three individuals in the file, why does every individual has genotype? As AF1 means allele frequency of ALT allele, it supposed to be AF1=5/6 at chr1:85691, why it's truely AF1=1? I'm so confused, anybody can help me? Thanks in advance.

vcf • 2.9k views
modified 24 months ago by Biostar ♦♦ 20 • written 8.5 years ago by Lds420
4
8.5 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

DP is just the raw read depth at that position. DP of 2 just means 2 reads out of your 3 individual samples mapped to that position and passes the quality filter. It doesn't say whether it is 2 reads from one of your individual or 2 read from 2 individuals..etc.

AF1 of 1 pretty much just means the 2 reads (2/2) both have a T instead of an C at that position.

Why are you expecting it to be 5/6? Are you trying to compare this to some other data?

edit **

Samtools is really not great at outputting and calculating the genotype field. Here is another biostar questions about it.

The PL fields showing 0,0,0 pretty much means there are no reads covering it. It looks like samtools falsely calculates 0,0,0 as heterozygous because the probabilities are similar (all zeroes in this case).

1

Thanks so much. You've got the point. When I add flag `-D` in samtools to call the variants, it keep read depth for each individual. What I get for chr1:85691 is,

``````chr1    85691   .   C   T   6.26    . DP=2;AF1=1;CI95=0.3333,1;DP4=0,0,2,0;MQ=40;FQ=-28.9   GT:PL:DP:GQ 0/1:0,0,0:0:3 1/1:20,3,0:1:4    1/1:18,3,0:1:4
``````

That's exactly what you said

1

So if I want to extract the genotype from VCF generated by samtools, I must be very careful about the DP, I should mark the genotype as missing data for samples with DP=0.

If DP of 2 means 2 reads out of 3 individuals, for example, if the 2 reads come from SLVi33.16 and SLVi33.25 instead of SLVi33.26, the genotype symbol for SLVi33.26 should be "./.". But in the file, the genotype symbols are 0/1, 1/1 and 1/1, which means the genotype for these 3 individuals is CT, TT and TT(T5+C1), respectively (and that why I said before AF1=5/6). In other word,we know the genotypes for 3 individuals, the DP should be greater than 6, but it's DP=2 here. How to explain this discordance? Thanks

Yeah I see what you mean. I think this maybe an error on samtools part. Samtools is pretty bad at outputting the genotype information. The fact that the PL fields show 0,0,0 pretty much means there is no read covering it.

Thanks so much. You've got the point. When I add flag `-D` in samtools to call the variants, it keep read depth for each individual. What I get for `chr1:85691` is,

``````chr1    85691   .   C   T   6.26    .   DP=2;AF1=1;CI95=0.3333,1;DP4=0,0,2,0;MQ=40;FQ=-28.9 GT:PL:DP:GQ 0/1:0,0,0:0:3   1/1:20,3,0:1:4  1/1:18,3,0:1:4
``````

That's exactly what you said