Question: Recoding bim to vcf file with forcing allele-A1
gravatar for marcus.tutert
7 months ago by
marcus.tutert0 wrote:


I am currently trying to format some of my bim files I have to run imputation on the sanger imputation server. For this purpose I need to convert all my SNP to forward strand and to be matched with ref variant on grch37.

I used the tool by the Mccarthy group online (HRC Imputation Preparation and Checking) which gave me a list of plink commands to run on my original files to get them in the right format. I did that and I got the updated versions of the bfiles as I needed. I looked at the files to make sure that things were changed as they should have been and they were. However, when I make the change from the bfile to the vcf for some reason my ref and alt alleles keep getting switched around and I have no clue why/how to make it stay the way I want it.

For instance, when I run the imputation on my new vcf I get this error:

Reference allele mismatch at 1:1118275 .. REF_SEQ:'C' vs VCF:'T'

When I check my updated bfiles, I have the following:

1       rs61733845      0       1118275 C       T

Furthermore, when I check which plink commands the Mccarthy tool suggested me to use on my original bfiles to do the ref/alt allele switching this (61733845) was one of the rsID's that it said needed to be flipped which indicates to me that the bfiles are fine but I am getting screwed over on the conversion to the vcf.

Even which I have tried to use plinks command to force the alleles to be as given by a reference text file when I am doing the recoding (like the one that is outputting by Mccarthys tools) with the commands (tried all of these(


--a1-allele [filename] {A1 allele col. number} {variant ID col.} {skip}
  (aliases: --reference-allele, --update-ref-allele)

--a2-allele [filename] {A2 allele col. number} {variant ID col.} {skip

I am still having no luck and I am still getting this reference allele mismatch at this position.

Any help would be greatly appreciated!!

plink genetics imputation • 482 views
ADD COMMENTlink modified 7 months ago by Kevin Blighe41k • written 7 months ago by marcus.tutert0
gravatar for Kevin Blighe
7 months ago by
Kevin Blighe41k
Guy's Hospital, London
Kevin Blighe41k wrote:

If your data is sourced from a genotyping array, then variants will be reported on both the forward and negative strand. You can look to the white [tech] papers for the array on the manufacturer's website in order to infer which genotypes these may be.

If your data is NGS-derived, then you'll have a nasty mixture of variants that do and don't match the reference genome, depending on which reference genome was used during alignment and variant calling that produced the VCF in the first place.

Note that, given a VCF, you can simply remove any variants not matching the reference with:

bcftools norm -m-any MyVariants.vcf | \
bcftools norm --check-ref x --fasta-ref human_g1k_v37.fasta -Ov > MyVariants.refchecked.vcf

The first part before the pipe splits multi-allelic sites. The second part then checks each REF allele in your VCF. If it does not match the reference genome, human_g1k_v37.fasta, then the variant site is instructed to be removed from the VCF with --check-ref x

There are other options for --check-ref, such as:

  • s, change the REF base to match reference genome
  • w, simply issue a warning
  • e, call error and exit

Take a look at Step 3, here ( Produce PCA bi-plot for 1000 Genomes Phase III - Version 2 ), to see how you can download human_g1k_v37.fasta.


NB - there's another guy/gal suffering from this issue, here: Help with Sanger Imputation Service

ADD COMMENTlink written 7 months ago by Kevin Blighe41k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2368 users visited in the last hour