check rsID of mismatched SNPs by bcftools fixref plugin
1
0
Entering edit mode
4.7 years ago

I used bcftools fixref plugin to match reference allele with GRCh37. As the result shown below, I have got 18901 unresolved SNPs.

1. Is there anyway to see the rsID of these unresolved SNPs?
2. Why are there unresolved SNPs?
3. What can I do to make these SNPs's reference allele match those in GRCh37?

Result of bcftools fixref

Thank you for any feedback.

In case you can't see the picture attached above, please go to https://ibb.co/nf1Sby

bcftools plugin fixref mismatch • 3.0k views
0
Entering edit mode

Have you not seen the blatant warning given in relation this the operation of this command?

[source: https://samtools.github.io/bcftools/howtos/plugin.fixref.html]

Can you elaborate on where you obtained your ensembl.bcf flle or how it was produced? Was it even aligned to hg19 / GRCh37?

0
Entering edit mode

Hi Kevin. Thanks for your comments. I used this unsafe command because when I used the

# Swap the alleles
bcftools +fixref broken.bcf -Ob -o fixref.bcf -- -d -f /path/to/reference.fasta -i All_20151104.vcf.gz


command, the mismatch rate achieved 65%. I think this is because some SNPs in my genotype file are on the reserve strand.

I started to produce the bcf file from PLINK bfiles. I first used PLINK1.9 to recode my bfiles to vcf file, then compressed it to vcf.gz, renamed chromosome 23 -> X etc, and used bcftools sort ensembl.vcf.gz -Ob -o ensembl.bcf and bcftools index ensembl.bcf commands to produce the bcf file and its index. I guess using bcftools +fixref test.bcf -Ob -o output.bcf -- -f ref.fa -m flip -d command can help deal with the SNPs on reserve strand directly.

Update: I used snpflip.py (https://github.com/biocore-ntnu/snpflip) to see the rsID of the unresolved SNPs. It turns out the 18902 SNPs are ambiguous ones and seems it is okay to drop them.

0
Entering edit mode
4.7 years ago

Hey, if you produced the data from PLINK, then it was possibly from a genotyping microarray study, in which case, yes, some of the alleles will be reported on separate strands. For certain protocols, these SNPs are actually removed from analyses, as I allude to here:

Exclude variants not on the coding strand NB This step is only for microarray studies where the probes may only target one strand or the other (sense or non-sense)

[source: Produce PCA bi-plot for 1000 Genomes Phase III in VCF format]

I have to ask: do you definitively have to correct these sites or could you nevertheless proceed with analysis using all sites and just note the issue?

1
Entering edit mode

Hi Kevin. Thanks for your answer. I decide to drop those ambiguous sites and proceed analysis.