GATK GenotypeGVCFs -all-sites
0
0
Entering edit mode
2.1 years ago
Simo ▴ 50

Hi, I'm working with GATK/4.1.2.0 on human whole-genome data.

I'm currently following the procedure to go from a gVCF to a VCF (the gVCF was obtained with HaplotypeCaller using -ERC GVCF).

The order of the tools I'm following is: GenotypeGVCFs -> VariantFiltration -> MakeSitesOnlyVcf -> VariantRecalibrator -> ApplyVQSR

Since I need to include also all the loci found to be non-variant after genotyping, I'm using the "-all-sites true" option in GenotypeGVCFs.

In the VCF I obtain from GenotypeGVCFs the majority of the 0/0 sites only have the DP in the INFO field but lack of all the other information that the VariantRecalibrator will need in a later step (e.g., QD,FS, SOR, MQ, MQRankSum, ReadPosRankSum, and InbreedingCoeff).

Is there any way to have those information for all the sites?

And if not, will the DP only be enough for the VariantRecalibrator to work on them?

For example, if I have these two sites in the VCF after GenotypeGVCFs:

chr1    10436   .       C       .       87.81   .       DP=55   GT:AD:DP:RGQ    0/0:55,0:55:51

chr1    13868    .    A    G    122.60    .    AC=1;AF=0.500;AN=2;BaseQRankSum=-2.950e-01;ClippingRankSum=-7.660e-01;DP=15;ExcessHet=3.0103;FS=15.564;MLEAC=1;MLEAF=0.500;MQ=32.73;MQRankSum=-2.534e+00;QD=8.17;RAW_MQ=16069.00;ReadPosRankSum=0.412;SOR=3.898 GT:AD:DP:GQ:PL0/1:9,6:15:99:130,0,248

Will the VariantRecalibrator need them to have the same INFO information or will it work properly in any case, even if the first site has only the DP and the second one has many other information?

I need the final VCF to include all the sites (0/0, 0/1, and 1/1). So far, everything I've tried always ended with removing all the 0/0 sites eventually.

Could someone please help me with this?

Thank you

GenotypeGVCFs gVCF gatk • 1.5k views
ADD COMMENT
1
Entering edit mode

Is there any way to have those information for all the sites?

no, those tags involve the presence of an ALT allele. for example MQRankSum:

Rank Sum Test for mapping qualities of REF versus ALT reads

ADD REPLY
0
Entering edit mode

Thank you Pierre, it makes total sense.

So, do I have to expect VariantRecalibrator to have problems with 0/0 sites or will they be maintained in the final VCF?

ADD REPLY
0
Entering edit mode

I don't know. Just test it. If 0/0 sites are removed, you can always merge a non-variants.vcf with recalibrated.vcf

ADD REPLY
0
Entering edit mode

Thank you very much, I'm running a test right now. According to the pipeline I'm following, the non-variants.vcf should be the one after the VariantFiltration step; after that step there is a high chance to loose the 0/0 sites. I'm only concerned about the quality of those sites though.

ADD REPLY
0
Entering edit mode

Hi, just an update regarding what you said:

no, those tags involve the presence of an ALT allele. for example MQRankSum:

Rank Sum Test for mapping qualities of REF versus ALT reads

I have cases like this one where an ALT allele is not present, but MQRankSum is reported anyway, as also other statistics:

chr1    16682   .   G   .   69.81   .   BaseQRankSum=0.559;ClippingRankSum=-1.162e+00;DP=30;ExcessHet=3.01;MQ=151.93;MQRankSum=-8.180e-01;RAW_MQ=23082.00;ReadPosRankSum=0.215  GT:DP:RGQ   0/0:29:33

I think that even if eventually a site is called as 0/0, this doesn't mean no ALT reads are present, and that's why MQRankSum can be calculated anyway.

So possibly, when stats like this one are not reported, only REF reads might be present for a specific site and that's why only DP is reported in the INFO column. On the other side, when some ALT reads are present, other INFO stats can be also calculated.

Of course, this is the explanation I've found more logic but maybe is too simplistic.

ADD REPLY

Login before adding your answer.

Traffic: 2477 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6