I have a vcf file where data has following structure.
##fileformat=VCFv4.1 ##fileDate=20140609 ##source=GAP VCF Generation V1.0; imputation by IMPUTE2 using 1000 genome Phase 3 as reference ##reference=Homo_sapiens_assembly19_v1 ##INFO=(ID=EXP_FREQ_A1,Number=.,Type=Float,Description="Expected frequency of allele coded 1") ##INFO=(ID=IMPINFO,Number=.,Type=Float,Description="Measure of the observed statistical information associated with the allele frequency estimate") ##INFO=(ID=CERTAINTY,Number=.,Type=Float,Description="Average certainty of best-guess genotypes") ##INFO=(ID=TYPE,Number=1,Type=Integer,Description="0-imputed SNP;2-genotyped SNP") ##INFO=(ID=MISS,Number=.,Type=Float,Description="Missingness of SNP with posterior probability threshold of 0.9") ##INFO=(ID=HW,Number=.,Type=Float,Description="Hardy Weinberg Equilibrium P value") ##FORMAT=(ID=GT:,Number=1,Type=String,Description="Best Guessed Genotype with posterior probability threshold of 0.9") ##FORMAT=(ID=GL,Number=3,Type=Float,Description="Posterior probability of 0/0, 0/1, and 1/1") ##FORMAT=(ID=DS,Number=1,Type=Float,Description="Dosage") ##FILTER=(ID=HWEPVALLT1E6,Description="Hardy-Weinberg p-Value <= 1E-6") ##FILTER=(ID=maf01,Description="allele frequency less than 1% or larger than 99%") ##FILTER=(ID=maf05,Description="allele frequency between 1-5% or 95-99%") ##FILTER=(ID=info4,Description="info from imputation less than 0.4") #CHROM POS ID REF ALT QUAL FILTER INFO FORMAT GTEX-P4PP-0004-SM-2H2WW GTEX-PWO3-0004-SM-2IJGP GTEX-SNOS-0004-SM-325AE GTEX-PW2O-0004-SM-2IJGJ 1 30923 1_30923_G_T_b37 G T . PASS EXP_FREQ_A1=0.732;IMPINFO=0.446;CERTAINTY=0.730;TYPE=0;MISS=0.7867;HW=0.31695 GT:GL:DS ./.:0.006,0.483,0.511:1.505 ./.:0.210,0.498,0.292:1.082 0/1:0.012,0.962,0.026:1.014 ./.:0.050,0.38,0.57:1.520 1 51479 1_51479_T_A_b37 T A . PASS EXP_FREQ_A1=0.182;IMPINFO=0.405;CERTAINTY=0.779;TYPE=0;MISS=0.6600;HW=0.87518 GT:GL:DS ./.:0.514,0.481,0.005:0.491 ./.:0.658,0.311,0.031:0.373 0/0:0.991,0.009,0:0.009 ./.:0.243,0.55,0.207:0.964
Now for every SNP in each sample I want to get the genotype(GT). When an entry has 0/0 or 0/1 or 1/1, I already have that information. But for many entry, GT is uncertain or not assigned indicated by ./. In such case I want to assign a genotype to that entry. I have used genomic likelihood(GL) to calculate that based on which likelihood have highest value. But I am now concerned whether I am doing it correctly. If a snp/sample has likelihood 0.48 to be 0/1 and 0.51 likelihood of being 0/0, my method would assign 0/0. But these two values are very close.
- What is the standard cutoff on GL to assign a genotype to a sample for a SNP?
- Since I want to relax the threshold a bit what criteria I should apply?