GATK SelectVariants --remove-unused-alternates dropping real INDELs?
Entering edit mode
8 months ago

I'm using a VCF that is generated by GenotypeGVCFs (so doing calibration based on a larger cohort of samples) and my goal is to only extract variants of interest to one specific sample.

The VCF in the subset tends to include some variants that were present in the original joint VCF (i.e. variants that are present in a different sample than the one I want to isolate).

I did this with bcftools initially but it seems to not recalculate the genotype AF (essentially it retains the mean AF calculated based the genotypes across samples):

Subset using bcftools:

bcftools view -c1 -s sampleA -o bcftools_test.vcf input.vcf
chr1       6029086               rs1243967206;rs34702999;rs869154835;rs869287558;rs59913966            CAA        CA,C,CAAA                3633.93                .               AC=1,0,0;AF=0.458,0.083,0.083;AN=2;AS_BaseQRankSum=-0.4,-0.35,-0.5;AS_FS=9.841,4.441,0;AS_InbreedingCoeff=-0.8462,-0.0914,-0.0909;AS_MQ=69.6,68.94,70;AS_MQRankSum=0,0,0;AS_QD=10.63,0.82,1.73;AS_ReadPosRankSum=-0.2,-0.7,0.3;AS_SOR=1.572,1.465,0.605;BaseQRankSum=-0.378;DB;DP=391;ExcessHet=10.6475;FS=9.123;InbreedingCoeff=-0.6;MLEAC=11,2,2;MLEAF=0.458,0.083,0.083;MQ=69.7;MQRankSum=0;QD=12.66;ReadPosRankSum=-0.192;SOR=1.505              GT:AD:DP:GQ:PL               0/1:7,14,3,0:24:82:351,0,82,283,82,561,372,152,511,572

I can also do this with GATK using Select Variants:

gatk SelectVariants --output gatk_test.vcf --sample-name sampleA --variant input.vcf --reference genome.fasta
chr1       6029086               rs1243967206;rs34702999;rs869154835;rs869287558;rs59913966            CAA        CA,C,CAAA                3633.93                .               AC=1,0,0;AF=0.500,0.00,0.00;AN=2;AS_BaseQRankSum=-0.400,-0.350,-0.500;AS_FS=9.841,4.441,0.000;AS_InbreedingCoeff=-0.8462,-0.0914,-0.0909;AS_MQ=69.60,68.94,70.00;AS_MQRankSum=0.000,0.000,0.000;AS_QD=10.63,0.82,1.73;AS_ReadPosRankSum=-0.200,-0.700,0.300;AS_SOR=1.572,1.465,0.605;BaseQRankSum=-3.780e-01;DB;DP=24;ExcessHet=10.6475;FS=9.123;InbreedingCoeff=-0.6000;MQ=69.70;MQRankSum=0.00;QD=12.66;ReadPosRankSum=-1.920e-01;SOR=1.505            GT:AD:DP:GQ:PL                0/1:7,14,3,0:24:82:351,0,82,283,82,561,372,152,511,57

To be safe and after searching the documentation I wanted to make sure that any variants that weren't specific to this sample were also removed and attempted to use the --remove-unused-alternates param.

gatk SelectVariants --output gatk_test2.vcf --sample-name sampleA --remove-unused-alternates true --variant input.vcf --reference genome.fasta

However the variant above is now gone from the output. I compared gatk_test.vcf and gatk_test2.vcf and a number of variants were removed. From what I can see all of them are INDELs and include multi-allelic variants where the ALT SNV at the location has a recalculated 0.00 VAF.

Am I missing something about the functionality of --remove-unused-alternates my expectation would be that it would solely remove the ALT alleles at 0.00 but retain those at >=0.5. So is this expected behavior? Or maybe it's looking at some other quality metrics to make this decision?

For example, for the VCF entry above I would assume it would drop C,CAAA and keep CA as an ALT.

I'm using GATK v4.3 in case this is just a bug that was later fixed but in this situation I can't easily update to the newest version.

GATK • 349 views

Login before adding your answer.

Traffic: 1903 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6