Is this GT a mistake?
2
2
Entering edit mode
3.4 years ago
aostern ▴ 30

I started with a VCF that had this variant:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  [SAMPLE]
12      7045891 .       ACAGCAGCAGCAGCAGCAGCAG  ACAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG,ACAGCAG,ACAGCAGCAGCAGCAGCAG,ACAGCAGCAGCAGCAGCAGCAGCAG,A,ACAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAGCAG    7038.49 PASS    AC=0,1,0,0,0,0;AF=0,0.5,0,0,0,0;AN=2;BaseQRankSum=0.113;ClippingRankSum=-0.367;DP=47;FS=2.374;GQ_MEAN=312.8;GQ_STDDEV=257.63;InbreedingCoeff=0.1964;MQ=60.51;MQ0=0;MQRankSum=0.286;NCC=0;POSITIVE_TRAIN_SITE;QD=8.44;ReadPosRankSum=0.946;SOR=0.893;VQSLOD=1.77;culprit=FS   GT:AD:DP:GQ:PL  0/2:24,0,23,0,0,0,0:47:99:885,957,3825,0,2868,2797,957,3825,2868,3825,957,3825,2868,3825,3825,957,3825,2868,3825,3825,3825,957,3825,2868,3825,3825,3825,3825


I ran it through a pipeline (unfortunately, not publicly available) that's supposed to normalize variants. It came out as:

12      7045891 .       A       ACAGCAGCAGCAGCAG        7038.49 PASS    AC=0;AF=0;AN=2;BaseQRankSum=0.113;ClippingRankSum=-0.367;DP=47;FS=2.374;GQ_MEAN=312.8;GQ_STDDEV=257.63;InbreedingCoeff=0.1964;MQ=60.51;MQ0=0;MQRankSum=0.286;NCC=0;POSITIVE_TRAIN_SITE;QD=8.44;ReadPosRankSum=0.946;SOR=0.893;VQSLOD=1.77;culprit=FS GT:AD:DP:GQ:PL  0/0:24,0,23,0,0,0,0:47:99:885,957,3825


The genotyope's gone from 0/2 to 0/0, which I think would mean a change from a deletion to a normal site. Is this variant being recalled? Or am I misinterpreting?

Update: So I looked through the source code, and it looks like the genotype is getting changed by this command:

bcftools norm --multiallelics - --check-ref x --fasta-ref=${fastaRef}${vcf} --output-type z --output ${o}  which splits the original variant into all of the following 12 7045891 . A ACAGCAGCAGCAGCAG 7038.49 PASS AC=0;AF=0;AN=2;BaseQRankSum=0.113;ClippingRankSum=-0.367;DP=47;FS=2.374;GQ_MEAN=312.8;GQ_STDDEV=257.63;InbreedingCoeff=0.1964;MQ=60.51;MQ0=0;MQRankSum=0.286;NCC=0;POSITIVE_TRAIN_SITE;QD=8.44;ReadPosRankSum=0.946;SOR=0.893;VQSLOD=1.77;culprit=FS GT:AD:DP:GQ:PL 0/0:24,0,23,0,0,0,0:47:99:885,957,3825 12 7045891 . ACAGCAGCAGCAGCAG A 7038.49 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=0.113;ClippingRankSum=-0.367;DP=47;FS=2.374;GQ_MEAN=312.8;GQ_STDDEV=257.63;InbreedingCoeff=0.1964;MQ=60.51;MQ0=0;MQRankSum=0.286;NCC=0;POSITIVE_TRAIN_SITE;QD=8.44;ReadPosRankSum=0.946;SOR=0.893;VQSLOD=1.77;culprit=FS GT:AD:DP:GQ:PL 0/1:24,0,23,0,0,0,0:47:99:885,0,2797 12 7045891 . ACAG A 7038.49 PASS AC=0;AF=0;AN=2;BaseQRankSum=0.113;ClippingRankSum=-0.367;DP=47;FS=2.374;GQ_MEAN=312.8;GQ_STDDEV=257.63;InbreedingCoeff=0.1964;MQ=60.51;MQ0=0;MQRankSum=0.286;NCC=0;POSITIVE_TRAIN_SITE;QD=8.44;ReadPosRankSum=0.946;SOR=0.893;VQSLOD=1.77;culprit=FS GT:AD:DP:GQ:PL 0/0:24,0,23,0,0,0,0:47:99:885,957,3825 12 7045891 . A ACAG 7038.49 PASS AC=0;AF=0;AN=2;BaseQRankSum=0.113;ClippingRankSum=-0.367;DP=47;FS=2.374;GQ_MEAN=312.8;GQ_STDDEV=257.63;InbreedingCoeff=0.1964;MQ=60.51;MQ0=0;MQRankSum=0.286;NCC=0;POSITIVE_TRAIN_SITE;QD=8.44;ReadPosRankSum=0.946;SOR=0.893;VQSLOD=1.77;culprit=FS GT:AD:DP:GQ:PL 0/0:24,0,23,0,0,0,0:47:99:885,957,3825 12 7045891 . ACAGCAGCAGCAGCAGCAGCAG A 7038.49 PASS AC=0;AF=0;AN=2;BaseQRankSum=0.113;ClippingRankSum=-0.367;DP=47;FS=2.374;GQ_MEAN=312.8;GQ_STDDEV=257.63;InbreedingCoeff=0.1964;MQ=60.51;MQ0=0;MQRankSum=0.286;NCC=0;POSITIVE_TRAIN_SITE;QD=8.44;ReadPosRankSum=0.946;SOR=0.893;VQSLOD=1.77;culprit=FS GT:AD:DP:GQ:PL 0/0:24,0,23,0,0,0,0:47:99:885,957,3825 12 7045891 . A ACAGCAGCAGCAGCAGCAGCAG 7038.49 PASS AC=0;AF=0;AN=2;BaseQRankSum=0.113;ClippingRankSum=-0.367;DP=47;FS=2.374;GQ_MEAN=312.8;GQ_STDDEV=257.63;InbreedingCoeff=0.1964;MQ=60.51;MQ0=0;MQRankSum=0.286;NCC=0;POSITIVE_TRAIN_SITE;QD=8.44;ReadPosRankSum=0.946;SOR=0.893;VQSLOD=1.77;culprit=FS GT:AD:DP:GQ:PL 0/0:24,0,23,0,0,0,0:47:99:885,957,3825  (Aha! The second variant in the list, with a genoytpe 0/1, is what I would expect.) But after this command bcftools norm -d both${vcf} --output-type z --output \${o}


the biallelic variants are winnowed down to the single homozygous ref insertion (pasted in the second code block). (The bcftools manual page says "-d: If a record is present multiple times, output only the first instance") Why is the 0/1 variant being thrown out? Could this possibly be purposeful?

Is this in some way standard practice? Assuming the goal of the pipeline is to normalize variants and split multiallelic ones, is this a mistake? Is this a sensical way of recasting one variant as another?

alignment vcf • 1.1k views
ADD COMMENT
1
Entering edit mode
3.4 years ago
kirannbishwa01 ★ 1.3k

Well, whatever pipeline you used must be filtering the genotypes. If you look at the AD tag of FORMAT field you have 24 reads supporting allele 0 - reference allele and 23 reads supporting allele number 2 - the second alternate allele. So, the pipeline you used must be retesting this alternate allele (with or without bam files) to see if the GT 0/2 is a confident one. Looks like this allele failed to pass and it was updated as reference genotype 0/0.

Also, there is only one sample in your input VCFs so other non required data also seemed to have trimmed off.

I think you need to check into the filtering parameters of your pipeline, and how Reference vs. Alternate allele-genotypes are tested. I am inclined to say that this behaviour should be normal - but check the details to make sure.

ADD COMMENT
0
Entering edit mode

Thank you so much for your response. Does my update shine any more light on the situation?

ADD REPLY
0
Entering edit mode

Read the bcftools documentation, and the corresponding output for the given sub commands. All you have to do is read into the mechanistic details of what the tools does for the given subcommand.

ADD REPLY
0
Entering edit mode

I've read the documentation—I now understand why each command has the effect it does. But my question stands: is this in some way standard practice? Assuming the goal of the pipeline is to normalize variants and split multiallelic ones, is this a mistake?

ADD REPLY
0
Entering edit mode
3.4 years ago
kirannbishwa01 ★ 1.3k

Standard practice differs depending upon what is the purpose of your data analyses: - Are you doing hard filtering? - What is the threshold while doing hardfiltering? - Are you will to assign the variants as reference if the confidence for alternate is low. Vs. Are you willing to leave the low confident variants as ./. - no variant When the GT at the site is of low confidence.

So, I cannot tell if this is a mistake without really getting more details. But, as explained earlier I am inclined to say it may be normal. But, what you are looking for is inside the mechanistic details of how the filtration of the GT is done by your pipeline. So, please check to make sure.

ADD COMMENT

Login before adding your answer.

Traffic: 1826 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6