Question

How To Distinguish Heterozygotes And Homozygotes From Variants In Vcf Format?

9

Entering edit mode

13.8 years ago

Shinvkuo ▴ 130

Hello, Recently, I used samtools to call variants and got the results in VCF format. I need to distinguish heterozygotes and homozygotes for each sample, then I seeked answers in forums and groups, and found several rules: 1) FQ; 2)AF1; 3)PL; 4)GT.

##fileformat=VCFv4.1
##samtoolsVersion=0.1.18 (r982:295)
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability of all samples being the same">
##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele frequency (assuming HWE)">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1.fastq.gz.baf.sorted.bam
chr10   126278  .       TAA     TCTGAAA 5.8     .       INDEL;DP=3;VDB=0.0404;AF1=1;AC1=2;DP4=0,0,1,0;MQ=42;FQ=-37.5    GT:PL:GQ        1/1:42,3,0:3
chr10   126402  .       T       C       3.98    .       DP=2;VDB=0.0200;AF1=1;AC1=2;DP4=0,0,0,2;MQ=32;FQ=-33    GT:PL:GQ        1/1:33,6,0:5

1) FQ: Notably, given one sample, FQ is positive at hets and negative at homs. From: http://samtools.sourceforge.net/mpileup.shtml <FQ=->homs

2) AF1: <AF1=1>homs

3) PL: For example, suppose REF=C and ALT=A,G, PL=7,0,37,13,40,49 means for the sample we are looking at, P(D|CC)=10^{-0.7}, P(D|CA)=1, P(D|AA)=10^{-3.7}, P(D|CG)=10^{-1.3}, P(D|AG)=1e-4 and P(D|GG)=10^{-4.9}. From: http://samtools.sourceforge.net/mpileup.shtml <,0:>homs

4) GT: <1/1>homs

According these rules, I did a rough statistics of homs and hets using the following commands:

ls *.vcf.gz|awk '{ print "less -S "$1"|grep -v \"^#\"|grep -v INDEL|grep \"FQ=-\"|wc -l && echo "$1; }'|sh
ls *.vcf.gz|awk '{ print "less -S "$1"|grep -v \"^#\"|grep -v INDEL|grep \"AF1=1\"|wc -l && echo "$1; }'|sh
ls *.vcf.gz|awk '{ print "less -S "$1"|grep -v \"^#\"|grep -v INDEL|grep \",0:\"|wc -l && echo "$1; }'|sh
ls *.vcf.gz|awk '{ print "less -S "$1"|grep -v \"^#\"|grep -v INDEL|grep \"1/1\"|wc -l && echo "$1; }'|sh

And I got different numbers of homs and hets for each sample, e.g.

20344    sample1.vcf.gz    # "FQ=-"; homs
16954    sample1.vcf.gz    # "AF1=1"; homs
16565    sample1.vcf.gz    # ",0:"; homs
15543    sample1.vcf.gz    # "1/1"; homs

7600    sample1.vcf.gz    # not "FQ=-"; hets
10990    sample1.vcf.gz    # not "AF1=1"; hets
11379    sample1.vcf.gz    # not ",0:"; hets
12401    sample1.vcf.gz    # not "1/1"; hets

How could this happen? Which is the most accurate rule for distinguishing heterozygotes and homozygotes, and any more rules to do this job?

Any suggestions and discussions could help me. Thanks!

vcf samtools • 20k views

ADD COMMENT • link updated 6.6 years ago by Biostar 20 • written 13.8 years ago by Shinvkuo ▴ 130

0

Entering edit mode

I have used snpEff and SeattleSeq Annotation tools to do variants' annotation, and found that they seems using the GT(Genotype) rule to distinguish homs and hets.

ADD REPLY • link 13.8 years ago by Shinvkuo ▴ 130

score 8 · Answer 1 · 2011-10-11

This is all explained in the VCF 4.0 Spec here

From the website:

As with the INFO field, there are several common, reserved keywords that are standards across the community:

GT genotype, encoded as alleles values separated by either of ”/” or “|”, e.g. The allele values are 0 for the reference allele (what is in the reference sequence), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on. For diploid calls examples could be 0/1 or 1|0 etc. For haploid calls, e.g. on Y, male X, mitochondrion, only one allele value should be given. All samples must have GT call information; if a call cannot be made for a sample at a given locus, ”.” must be specified for each missing allele in the GT field (for example ./. for a diploid). The meanings of the separators are:

    / : genotype unphased
    | : genotype phased

score 1 · Answer 2 · 2011-10-25

1

Entering edit mode

13.7 years ago

Swbarnes2 ★ 1.6k

How did you screen for quality? Probably lots of the discrepancies are low quality SNPs. If you counted only high quality variants, you'd probably see far more agreement between the 4 measures.

ADD COMMENT • link 13.7 years ago by Swbarnes2 ★ 1.6k