Question: How To Distinguish Heterozygotes And Homozygotes From Variants In Vcf Format?
9
gravatar for Shinvkuo
7.6 years ago by
Shinvkuo130
Shinvkuo130 wrote:

Hello, Recently, I used samtools to call variants and got the results in VCF format. I need to distinguish heterozygotes and homozygotes for each sample, then I seeked answers in forums and groups, and found several rules: 1) FQ; 2)AF1; 3)PL; 4)GT.

##fileformat=VCFv4.1
##samtoolsVersion=0.1.18 (r982:295)
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability of all samples being the same">
##INFO=<ID=AF1,Number=1,Type=Float,Description="Max-likelihood estimate of the first ALT allele frequency (assuming HWE)">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="List of Phred-scaled genotype likelihoods">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Sample1.fastq.gz.baf.sorted.bam
chr10   126278  .       TAA     TCTGAAA 5.8     .       INDEL;DP=3;VDB=0.0404;AF1=1;AC1=2;DP4=0,0,1,0;MQ=42;FQ=-37.5    GT:PL:GQ        1/1:42,3,0:3
chr10   126402  .       T       C       3.98    .       DP=2;VDB=0.0200;AF1=1;AC1=2;DP4=0,0,0,2;MQ=32;FQ=-33    GT:PL:GQ        1/1:33,6,0:5

1) FQ: Notably, given one sample, FQ is positive at hets and negative at homs. From: http://samtools.sourceforge.net/mpileup.shtml <FQ=->homs

2) AF1: <AF1=1>homs

3) PL: For example, suppose REF=C and ALT=A,G, PL=7,0,37,13,40,49 means for the sample we are looking at, P(D|CC)=10^{-0.7}, P(D|CA)=1, P(D|AA)=10^{-3.7}, P(D|CG)=10^{-1.3}, P(D|AG)=1e-4 and P(D|GG)=10^{-4.9}. From: http://samtools.sourceforge.net/mpileup.shtml <,0:>homs

4) GT: <1/1>homs

According these rules, I did a rough statistics of homs and hets using the following commands:

ls *.vcf.gz|awk '{ print "less -S "$1"|grep -v \"^#\"|grep -v INDEL|grep \"FQ=-\"|wc -l && echo "$1; }'|sh
ls *.vcf.gz|awk '{ print "less -S "$1"|grep -v \"^#\"|grep -v INDEL|grep \"AF1=1\"|wc -l && echo "$1; }'|sh
ls *.vcf.gz|awk '{ print "less -S "$1"|grep -v \"^#\"|grep -v INDEL|grep \",0:\"|wc -l && echo "$1; }'|sh
ls *.vcf.gz|awk '{ print "less -S "$1"|grep -v \"^#\"|grep -v INDEL|grep \"1/1\"|wc -l && echo "$1; }'|sh

And I got different numbers of homs and hets for each sample, e.g.

20344    sample1.vcf.gz    # "FQ=-"; homs
16954    sample1.vcf.gz    # "AF1=1"; homs
16565    sample1.vcf.gz    # ",0:"; homs
15543    sample1.vcf.gz    # "1/1"; homs

7600    sample1.vcf.gz    # not "FQ=-"; hets
10990    sample1.vcf.gz    # not "AF1=1"; hets
11379    sample1.vcf.gz    # not ",0:"; hets
12401    sample1.vcf.gz    # not "1/1"; hets

How could this happen? Which is the most accurate rule for distinguishing heterozygotes and homozygotes, and any more rules to do this job?

Any suggestions and discussions could help me. Thanks!

vcf samtools • 15k views
ADD COMMENTlink modified 5 months ago by Biostar ♦♦ 20 • written 7.6 years ago by Shinvkuo130

I have used snpEff and SeattleSeq Annotation tools to do variants' annotation, and found that they seems using the GT(Genotype) rule to distinguish homs and hets.

ADD REPLYlink written 7.6 years ago by Shinvkuo130
8
gravatar for Daniel Swan
7.6 years ago by
Daniel Swan13k
Aberdeen, UK
Daniel Swan13k wrote:

This is all explained in the VCF 4.0 Spec here

From the website:

As with the INFO field, there are several common, reserved keywords that are standards across the community:

GT genotype, encoded as alleles values separated by either of ”/” or “|”, e.g. The allele values are 0 for the reference allele (what is in the reference sequence), 1 for the first allele listed in ALT, 2 for the second allele list in ALT and so on. For diploid calls examples could be 0/1 or 1|0 etc. For haploid calls, e.g. on Y, male X, mitochondrion, only one allele value should be given. All samples must have GT call information; if a call cannot be made for a sample at a given locus, ”.” must be specified for each missing allele in the GT field (for example ./. for a diploid). The meanings of the separators are:

    / : genotype unphased
    | : genotype phased
ADD COMMENTlink written 7.6 years ago by Daniel Swan13k
1

That answers that the GT field is probably the best one to use, but it doesn't answer why the different "tests" give five different values. There's a 5k difference between tests, that's a lot!

ADD REPLYlink written 7.6 years ago by Louis Letourneau790
1
gravatar for Swbarnes2
7.6 years ago by
Swbarnes21.5k
Swbarnes21.5k wrote:

How did you screen for quality? Probably lots of the discrepancies are low quality SNPs. If you counted only high quality variants, you'd probably see far more agreement between the 4 measures.

ADD COMMENTlink written 7.6 years ago by Swbarnes21.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1881 users visited in the last hour