Question: Why invariant blocks in GATK consistently have very low quality scores (but not variant sites)
0
gravatar for yixianshuiesuan
5 weeks ago by
yixianshuiesuan0 wrote:

I am using the latest GATK 4.1.2.0 to do variant calling on insect samples with a reference genome of a closely related species. The heterozygosity is approximately 0.02. I followed the standard pipeline of "HaplotypeCaller --> GenomicDBImport --> GenotypeGVCFs" to get my unfiltered VCFs, however, although my variant sites have very nice quality support, all the other invariant regions have been marked as LowQual almost throughout the entire VCF. Here is an example:

HiC_scaffold_10 186     .       A       .       19.18   LowQual DP=25   GT:AD:DP:RGQ    0/0:3,0:3:6     0/0:4,0:4:12    0/0:2,0:2:6
HiC_scaffold_10 187     .       A       .       16.03   LowQual DP=23   GT:AD:DP:RGQ    0/0:3,0:3:0     0/0:4,0:4:12    0/0:2,0:2:6
HiC_scaffold_10 188     .       A       T       70.57   .       AC=1;AF=0.056;AN=18;BaseQRankSum=0.00;ClippingRankSum=0.00;DP=23;Exces
HiC_scaffold_10 189     .       A       AG      60.63   .       AC=1;AF=0.056;AN=18;BaseQRankSum=0.00;ClippingRankSum=0.00;DP=23;Exces
HiC_scaffold_10 190     .       A       .       16.43   LowQual DP=24   GT:AD:DP:RGQ    0/0:3,0:3:0     0/0:4,0:4:12    0/0:2,0:2:6
HiC_scaffold_10 191     .       C       .       16.47   LowQual DP=24   GT:AD:DP:RGQ    0/0:3,0:3:0     0/0:4,0:4:12    0/0:2,0:2:6
HiC_scaffold_10 192     .       C       .       16.18   LowQual DP=24   GT:AD:DP:RGQ    0/0:3,0:3:0     0/0:4,0:4:12    0/0:2,0:2:6
HiC_scaffold_10 193     .       G       A       276.17  .       AC=6;AF=0.429;AN=14;BaseQRankSum=1.38;ClippingRankSum=0.00;DP=19;Exces
HiC_scaffold_10 194     .       C       .       15.80   LowQual DP=26   GT:AD:DP:RGQ    ./.:4,0:4       0/0:4,0:4:12    0/0:2,0:2:3
HiC_scaffold_10 195     .       A       .       15.80   LowQual DP=26   GT:AD:DP:RGQ    ./.:4,0:4       0/0:4,0:4:12    0/0:2,0:2:3
HiC_scaffold_10 196     .       T       .       18.10   LowQual DP=29   GT:AD:DP:RGQ    ./.:4,0:4       0/0:5,0:5:15    0/0:2,0:2:3
HiC_scaffold_10 197     .       C       .       18.10   LowQual DP=29   GT:AD:DP:RGQ    ./.:4,0:4       0/0:5,0:5:15    0/0:2,0:2:3
HiC_scaffold_10 198     .       A       .       18.72   LowQual DP=31   GT:AD:DP:RGQ    ./.:4,0:4       0/0:5,0:5:9     0/0:2,0:2:3
HiC_scaffold_10 199     .       A       T       67.64   .       AC=1;AF=0.063;AN=16;BaseQRankSum=0.00;ClippingRankSum=0.00;DP=28;Exces
HiC_scaffold_10 200     .       A       .       15.95   LowQual DP=29   GT:AD:DP:RGQ    ./.:4,0:4       0/0:5,0:5:9     ./.:0,0:0
HiC_scaffold_10 201     .       A       .       19.76   LowQual DP=30   GT:AD:DP:RGQ    0/0:5,0:5:9     0/0:5,0:5:9     ./.:0,0:0
HiC_scaffold_10 202     .       T       A       66.78   .       AC=1;AF=0.056;AN=18;BaseQRankSum=-5.240e-01;ClippingRankSum=0.00;DP=30
HiC_scaffold_10 203     .       C       .       15.95   LowQual DP=30   GT:AD:DP:RGQ    ./.:6,0:6       0/0:5,0:5:9     ./.:0,0:0
HiC_scaffold_10 204     .       C       T       37.72   .       AC=2;AF=0.125;AN=16;DP=30;ExcessHet=0.1703;FS=0.000;MLEAC=2;MLEAF=0.12
HiC_scaffold_10 205     .       G       .       15.95   LowQual DP=30   GT:AD:DP:RGQ    ./.:6,0:6       0/0:5,0:5:9     ./.:0,0:0
HiC_scaffold_10 206     .       T       C       389.90  .       AC=7;AF=0.500;AN=14;BaseQRankSum=1.64;ClippingRankSum=0.00;DP=26;Exces
HiC_scaffold_10 207     .       T       .       15.95   LowQual DP=27   GT:AD:DP:RGQ    ./.:6,0:6       0/0:3,0:3:9     ./.:0,0:0
HiC_scaffold_10 208     .       G       T       64.64   .       AC=1;AF=0.063;AN=16;BaseQRankSum=-5.240e-01;ClippingRankSum=0.00;DP=28

As you see, even invariant sites embedded between two high-quality variants with large coverage have been assigned a low quality score. The corresponding raw GVCF files also showed the same pattern, i.e. deep coverage but small genotype quality scores (GQ) in the invariant regions.

Then I adjust my HaplotyeCaller option from

-ERC GVCF

to

-ERC BP_RESOLUTION

After that, I got my raw GVCF back, which on the contrary has high genotype quality scores where the coverage is high. For comparison, I extracted position 271 to 284 on scaffold 10 from both types of GVCFs in one sample. Here is the one associated with the "-ERC GVCF" option:

HiC_scaffold_10 271     .       T       <NON_REF>       .       .       END=284 GT:DP:GQ:MIN_DP:PL      0/0:21:0:20:0,0,0

Here is the one associated with the "-ERC BP_RESOLUTION" option:

HiC_scaffold_10 271     .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:15,0:15:45:0,45,490
HiC_scaffold_10 272     .       A       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:15,0:15:45:0,45,490
HiC_scaffold_10 273     .       C       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:15,0:15:45:0,45,478
HiC_scaffold_10 274     .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:15,0:15:45:0,45,490
HiC_scaffold_10 275     .       A       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:16,0:16:48:0,48,486
HiC_scaffold_10 276     .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:16,0:16:48:0,48,498
HiC_scaffold_10 277     .       G       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:16,0:16:48:0,48,472
HiC_scaffold_10 278     .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:17,0:17:51:0,51,496
HiC_scaffold_10 279     .       A       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:17,0:17:51:0,51,506
HiC_scaffold_10 280     .       G       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:17,0:17:51:0,51,494
HiC_scaffold_10 281     .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:17,0:17:51:0,51,506
HiC_scaffold_10 282     .       G       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:18,0:18:54:0,54,530
HiC_scaffold_10 283     .       A       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:18,0:18:54:0,54,530
HiC_scaffold_10 284     .       T       <NON_REF>       .       .       .       GT:AD:DP:GQ:PL  0/0:21,0:21:63:0,63,617

I don't understand why their genotype quality score (GQ) looks so different from each other? Isn't GVCF supposed to be a reduced representation of the basepair-wise file?

snp gatk variant calling • 84 views
ADD COMMENTlink written 5 weeks ago by yixianshuiesuan0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 552 users visited in the last hour