Problematic mismatch ref alleles when Annotate VCF file
1
0
Entering edit mode
10 weeks ago
nanodano ▴ 10

I'm trying to annotate IDs in a VCF with WG data using another VCF with genotype data. Both are aligned to the same reference, hg19. One thing to note is there are misaligned references for some reason? See here

POphased chr22 (genotype data VCF)

22      17075353        rs5747999       A       C       .       PASS    .       GT      0|1     1|1     1|0     0|1     1|1
22      17203103        rs2845380       A       G       .       PASS    .       GT      1|1     1|1     1|1     1|1     1|1
22      17282666        rs5994022       G       A       .       PASS    .       GT      1|1     1|1     1|1     1|1     1|1


Peak at AGR chr22 (WG data VCF)

22      17075353        .       C       A       .       .       .       GT      0|1     0|0     1|1     1|0     1|1
22      17203103        .       A       G       .       .       .       GT      1|1     1|1     1|1     1|1     1|1
22      17282666        .       G       A       .       .       .       GT      1|1     0|1     1|1     1|1     1|1


I used the following bcftool line to annotate overlapping positions across the data:

bcftools annotate -c ID -a SA_POtest.recode.vcf.gz -o annot.vcf AGR_test.recode.vcf.gz


This seemed to work but only for SNPs that have the same ref. allele (i.e. no misalignment), which is a very small subset of the total SNPs available in the genotype data (1126 / 21640). Looking at the same positions in the new file, you find the following pattern, rsIDs are present where ref/alt alleles match and vice versa where mistmatches lead to no missing rsIDs.

Peak at annot.vcf

22  17075353    .   C   A   .   .   .   GT  0|1 0|0 1|1 1|0 1|1
22  17203103    rs2845380   A   G   .   .   .   GT  1|1 1|1 1|1 1|1 1|1
22  17282666    rs5994022   G   A   .   .   .   GT  1|1 0|1 1|1 1|1 1|1


How can I fix the mismatch Ref/Alt alleles?

rsID annotate vcf vcftools bcftools • 243 views
0
Entering edit mode
10 weeks ago

how would you explain that the same position lists a different reference allele?

The reference at 17075353 is either A or C you can't have both.

Why would you want it annotated that as the rsid if it does not match the rsid?

0
Entering edit mode

I suspect they (ref/alt) got switched somewhere during a snake-make phasing pipeline or during file format converting. Both data sets were aligned to the same reference, which to my understanding, supports that they shouldn’t have such large discrepancies.