How To Distinguish Heterozygotes And Homozygotes From Variants In Vcf Format?
2
9
Entering edit mode
12.5 years ago
Shinvkuo ▴ 130

Hello, Recently, I used samtools to call variants and got the results in VCF format. I need to distinguish heterozygotes and homozygotes for each sample, then I seeked answers in forums and groups, and found several rules: 1) FQ; 2)AF1; 3)PL; 4)GT.

##fileformat=VCFv4.1
##samtoolsVersion=0.1.18 (r982:295)
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability of all samples being the same">
##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele frequency (assuming HWE)">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1.fastq.gz.baf.sorted.bam
chr10   126278  .       TAA     TCTGAAA 5.8     .       INDEL;DP=3;VDB=0.0404;AF1=1;AC1=2;DP4=0,0,1,0;MQ=42;FQ=-37.5    GT:PL:GQ        1/1:42,3,0:3
chr10   126402  .       T       C       3.98    .       DP=2;VDB=0.0200;AF1=1;AC1=2;DP4=0,0,0,2;MQ=32;FQ=-33    GT:PL:GQ        1/1:33,6,0:5

1) FQ: Notably, given one sample, FQ is positive at hets and negative at homs. From: http://samtools.sourceforge.net/mpileup.shtml <FQ=->homs

2) AF1: <AF1=1>homs

3) PL: For example, suppose REF=C and ALT=A,G, PL=7,0,37,13,40,49 means for the sample we are looking at, P(D|CC)=10^{-0.7}, P(D|CA)=1, P(D|AA)=10^{-3.7}, P(D|CG)=10^{-1.3}, P(D|AG)=1e-4 and P(D|GG)=10^{-4.9}. From: http://samtools.sourceforge.net/mpileup.shtml <,0:>homs

4) GT: <1/1>homs

According these rules, I did a rough statistics of homs and hets using the following commands:

ls *.vcf.gz|awk '{ print "less -S "$1"|grep -v \"^#\"|grep -v INDEL|grep \"FQ=-\"|wc -l && echo "$1; }'|sh
ls *.vcf.gz|awk '{ print "less -S "$1"|grep -v \"^#\"|grep -v INDEL|grep \"AF1=1\"|wc -l && echo "$1; }'|sh
ls *.vcf.gz|awk '{ print "less -S "$1"|grep -v \"^#\"|grep -v INDEL|grep \",0:\"|wc -l && echo "$1; }'|sh
ls *.vcf.gz|awk '{ print "less -S "$1"|grep -v \"^#\"|grep -v INDEL|grep \"1/1\"|wc -l && echo "$1; }'|sh

And I got different numbers of homs and hets for each sample, e.g.

20344    sample1.vcf.gz    # "FQ=-"; homs
16954    sample1.vcf.gz    # "AF1=1"; homs
16565    sample1.vcf.gz    # ",0:"; homs
15543    sample1.vcf.gz    # "1/1"; homs

7600    sample1.vcf.gz    # not "FQ=-"; hets
10990    sample1.vcf.gz    # not "AF1=1"; hets
11379    sample1.vcf.gz    # not ",0:"; hets
12401    sample1.vcf.gz    # not "1/1"; hets

How could this happen? Which is the most accurate rule for distinguishing heterozygotes and homozygotes, and any more rules to do this job?

Any suggestions and discussions could help me. Thanks!

vcf samtools • 20k views
ADD COMMENT
0
Entering edit mode

I have used snpEff and SeattleSeq Annotation tools to do variants' annotation, and found that they seems using the GT(Genotype) rule to distinguish homs and hets.

ADD REPLY
8
Entering edit mode
12.5 years ago
User 59 13k

This is all explained in the VCF 4.0 Spec here

From the website:

As with the INFO field, there are several common, reserved keywords that are standards across the community:

GT genotype, encoded as alleles values separated by either of ”/” or “|”, e.g. The allele values are 0 for the reference allele (what is in the reference sequence), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on. For diploid calls examples could be 0/1 or 1|0 etc. For haploid calls, e.g. on Y, male X, mitochondrion, only one allele value should be given. All samples must have GT call information; if a call cannot be made for a sample at a given locus, ”.” must be specified for each missing allele in the GT field (for example ./. for a diploid). The meanings of the separators are:

    / : genotype unphased
    | : genotype phased
ADD COMMENT
1
Entering edit mode

That answers that the GT field is probably the best one to use, but it doesn't answer why the different "tests" give five different values. There's a 5k difference between tests, that's a lot!

ADD REPLY
1
Entering edit mode
12.5 years ago
Swbarnes2 ★ 1.6k

How did you screen for quality? Probably lots of the discrepancies are low quality SNPs. If you counted only high quality variants, you'd probably see far more agreement between the 4 measures.

ADD COMMENT

Login before adding your answer.

Traffic: 2749 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6