I have a vcf file of a plant species obtained with GATK4 and I would like to impute few missing genotypes, including genotype probabilities or likelihoods in the output, so that I can then convert to BIMBAM format to run GEMMA for GWAS.
I tried BEAGLE (both versions 4.1 and 5.1) for imputation, adding the gp=true option to include GP field in the output vcf file, but it does not do what I want. The output still only contains GT filed and no GP field.
Here is my command:
java -Xss5m -Xmx20g -jar beagle.jar gt=input.vcf gp=true out=imputed_example
And a small example of my input.vcf file:
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  TA_AM_01_01_F3_CC0_M1_1 TA_DE_01_02_F1_HC0_M1_1 TA_DE_01_04_F1_HC0_M1_1
Scaffold_1  2393    .   C   A   90.60   PASS    AC=4;AF=0.010;AN=386;BaseQRankSum=0.00;DP=981;ExcessHet=0.0280;FS=2.077;InbreedingCoeff=0.2413;MQ=22.83;MQRankSum=0.319;QD=6.97;ReadPosRankSum=-6.740e-01;SOR=1.793 GT:AD:DP:GQ:PL  ./.:0,0:0:.:0,0,0   0/0:9,0:9:27:0,27,326   0/0:7,0:7:21:0,21,266
Scaffold_1  2413    .   G   T   210.92  PASS    AC=7;AF=0.018;AN=398;BaseQRankSum=0.00;DP=1383;ExcessHet=0.0000;FS=0.000;InbreedingCoeff=0.3551;MQ=22.52;MQRankSum=-1.383e+00;QD=16.22;ReadPosRankSum=-1.383e+00;SOR=0.666  GT:AD:DP:GQ:PL  ./.:0,0:0:.:0,0,0   0/0:10,0:10:30:0,30,391 0/0:7,0:7:21:0,21,292
Scaffold_1  2520    .   A   T   33956.70    PASS    AC=170;AF=0.417;AN=408;BaseQRankSum=0.00;DP=2159;ExcessHet=0.0000;FS=2.541;InbreedingCoeff=0.8530;MQ=24.94;MQRankSum=-1.000e+00;QD=27.82;ReadPosRankSum=0.00;SOR=1.255  GT:AD:DP:GQ:PGT:PID:PL:PS   ./.:0,0:0:.:.:.:0,0,0   1|1:0,10:10:30:1|1:8017321_A_T:408,30,0:8017321 0/0:8,0:8:24:.:.:0,24,308
Scaffold_1  2530    .   C   T   31680.20    PASS    AC=167;AF=0.413;AN=404;BaseQRankSum=0.00;DP=1979;ExcessHet=-0.0000;FS=2.549;InbreedingCoeff=0.8469;MQ=25.10;MQRankSum=-1.280e+00;QD=31.17;ReadPosRankSum=0.319;SOR=1.291    GT:AD:DP:GQ:PGT:PID:PL:PS   ./.:0,0:0:.:.:.:0,0,0   1|1:0,8:8:24:1|1:8017321_A_T:342,24,0:8017321   0/0:8,0:8:24:.:.:0,24,308
Does anyone know how to include GP field in BEAGLE output? Or any other software that can use phasing and LD to produce genotype probabilities or likelihoods for missing genotype calls?
Thanks very much
Can you provide some lines of the output of the vcf which you thought should have contained the GP values?
Sure, the output looks like this, with GT, but no GP values:
All of those snps were in the original VCF, so they won't have gps, since only imputed markers would have GPs. can you show some imputed SNPs in the output file?
Well, all those SNPs are missing in the first individual (TA_AM_01_01_F3_CC0_M1_1) in the input (GT --> "./.") and they are imputed and genotyped in the output (0|0). Or do you mean that GPs are given only for SNP absent in all individuals and present only in the reference panel (which I am not using because I don't have it for my species)?
I think the latter might be true, but I'll have a test run today on my PC to see if it is.