Impute genotype probabilities with BEAGLE
1
0
Entering edit mode
6 weeks ago

I have a vcf file of a plant species obtained with GATK4 and I would like to impute few missing genotypes, including genotype probabilities or likelihoods in the output, so that I can then convert to BIMBAM format to run GEMMA for GWAS. I tried BEAGLE (both versions 4.1 and 5.1) for imputation, adding the gp=true option to include GP field in the output vcf file, but it does not do what I want. The output still only contains GT filed and no GP field.

Here is my command:

java -Xss5m -Xmx20g -jar beagle.jar gt=input.vcf gp=true out=imputed_example


And a small example of my input.vcf file:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  TA_AM_01_01_F3_CC0_M1_1 TA_DE_01_02_F1_HC0_M1_1 TA_DE_01_04_F1_HC0_M1_1
Scaffold_1  2393    .   C   A   90.60   PASS    AC=4;AF=0.010;AN=386;BaseQRankSum=0.00;DP=981;ExcessHet=0.0280;FS=2.077;InbreedingCoeff=0.2413;MQ=22.83;MQRankSum=0.319;QD=6.97;ReadPosRankSum=-6.740e-01;SOR=1.793 GT:AD:DP:GQ:PL  ./.:0,0:0:.:0,0,0   0/0:9,0:9:27:0,27,326   0/0:7,0:7:21:0,21,266
Scaffold_1  2413    .   G   T   210.92  PASS    AC=7;AF=0.018;AN=398;BaseQRankSum=0.00;DP=1383;ExcessHet=0.0000;FS=0.000;InbreedingCoeff=0.3551;MQ=22.52;MQRankSum=-1.383e+00;QD=16.22;ReadPosRankSum=-1.383e+00;SOR=0.666  GT:AD:DP:GQ:PL  ./.:0,0:0:.:0,0,0   0/0:10,0:10:30:0,30,391 0/0:7,0:7:21:0,21,292
Scaffold_1  2520    .   A   T   33956.70    PASS    AC=170;AF=0.417;AN=408;BaseQRankSum=0.00;DP=2159;ExcessHet=0.0000;FS=2.541;InbreedingCoeff=0.8530;MQ=24.94;MQRankSum=-1.000e+00;QD=27.82;ReadPosRankSum=0.00;SOR=1.255  GT:AD:DP:GQ:PGT:PID:PL:PS   ./.:0,0:0:.:.:.:0,0,0   1|1:0,10:10:30:1|1:8017321_A_T:408,30,0:8017321 0/0:8,0:8:24:.:.:0,24,308
Scaffold_1  2530    .   C   T   31680.20    PASS    AC=167;AF=0.413;AN=404;BaseQRankSum=0.00;DP=1979;ExcessHet=-0.0000;FS=2.549;InbreedingCoeff=0.8469;MQ=25.10;MQRankSum=-1.280e+00;QD=31.17;ReadPosRankSum=0.319;SOR=1.291    GT:AD:DP:GQ:PGT:PID:PL:PS   ./.:0,0:0:.:.:.:0,0,0   1|1:0,8:8:24:1|1:8017321_A_T:342,24,0:8017321   0/0:8,0:8:24:.:.:0,24,308


Does anyone know how to include GP field in BEAGLE output? Or any other software that can use phasing and LD to produce genotype probabilities or likelihoods for missing genotype calls?

Thanks very much

0
Entering edit mode

Can you provide some lines of the output of the vcf which you thought should have contained the GP values?

0
Entering edit mode

Sure, the output looks like this, with GT, but no GP values:

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  TA_AM_01_01_F3_CC0_M1_1 TA_DE_01_02_F1_HC0_M1_1 TA_DE_01_04_F1_HC0_M1_1
Scaffold_1  2393    .   C   A   .   PASS    .   GT  0|0 0|0 0|0
Scaffold_1  2413    .   G   T   .   PASS    .   GT  0|0 0|0 0|0
Scaffold_1  2520    .   A   T   .   PASS    .   GT  0|0 1|1 0|0
Scaffold_1  2530    .   C   T   .   PASS    .   GT  0|0 1|1 0|0

0
Entering edit mode

All of those snps were in the original VCF, so they won't have gps, since only imputed markers would have GPs. can you show some imputed SNPs in the output file?

0
Entering edit mode

Well, all those SNPs are missing in the first individual (TA_AM_01_01_F3_CC0_M1_1) in the input (GT --> "./.") and they are imputed and genotyped in the output (0|0). Or do you mean that GPs are given only for SNP absent in all individuals and present only in the reference panel (which I am not using because I don't have it for my species)?

0
Entering edit mode

I think the latter might be true, but I'll have a test run today on my PC to see if it is.

1
Entering edit mode
5 weeks ago

I might have found a partial solution. Somehow in BEAGLE 4.0 the gprobs=true option (same as gp=true in BEAGLE 5) seems to work (also without reference panel) and the output vcf does contain genotype probabilities (GP) and dosage (DS). Here is an example for the same 4 SNPs in the input example.

#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  TA_AM_01_01_F3_CC0_M1_1 TA_DE_01_02_F1_HC0_M1_1 TA_DE_01_04_F1_HC0_M1_1
Scaffold_1  2393    .   C   A   .   PASS    AR2=0.965;DR2=0.966;AF=0.01 GT:DS:GP    0|0:0.012:0.988,0.012,0 0|0:0:1,0,0 0|0:0:1,0,0
Scaffold_1  2413    .   G   T   .   PASS    AR2=0.998;DR2=0.998;AF=0.022    GT:DS:GP    0|0:0.003:0.997,0.003,0 0|0:0:1,0,0 0|0:0:1,0,0
Scaffold_1  2520    .   A   T   .   PASS    AR2=0.994;DR2=0.996;AF=0.411    GT:DS:GP    1|1:1.438:0.134,0.295,0.572 1|1:2:0,0,1 0|0:0:1,0,0
Scaffold_1  2530    .   C   T   .   PASS    AR2=0.995;DR2=0.996;AF=0.408    GT:DS:GP    1|1:1.436:0.135,0.295,0.57  1|1:2:0,0,1 0|0:0:1,0,0


It is not exactly as I was hoping, because all covered genotypes have GPs rounded to the most likely genotype (eg. 0|0:0:1,0,0), so the PL field of the input files ("Normalized" Phred-scaled likelihoods of the possible genotypes) is not taken into account in the calculations. But at least PLINK 4.0 works at it should. I still don't get why more recent software doesn't do what older software does though.