Question

Recoding bim to vcf file with forcing allele-A1

0

Entering edit mode

5.7 years ago

marcus.tutert • 0

Hi,

I am currently trying to format some of my bim files I have to run imputation on the sanger imputation server. For this purpose I need to convert all my SNP to forward strand and to be matched with ref variant on grch37.

I used the tool by the Mccarthy group online (HRC Imputation Preparation and Checking) which gave me a list of plink commands to run on my original files to get them in the right format. I did that and I got the updated versions of the bfiles as I needed. I looked at the files to make sure that things were changed as they should have been and they were. However, when I make the change from the bfile to the vcf for some reason my ref and alt alleles keep getting switched around and I have no clue why/how to make it stay the way I want it.

For instance, when I run the imputation on my new vcf I get this error:

Reference allele mismatch at 1:1118275 .. REF_SEQ:'C' vs VCF:'T'

When I check my updated bfiles, I have the following:

1       rs61733845      0       1118275 C       T

Furthermore, when I check which plink commands the Mccarthy tool suggested me to use on my original bfiles to do the ref/alt allele switching this (61733845) was one of the rsID's that it said needed to be flipped which indicates to me that the bfiles are fine but I am getting screwed over on the conversion to the vcf.

Even which I have tried to use plinks command to force the alleles to be as given by a reference text file when I am doing the recoding (like the one that is outputting by Mccarthys tools) with the commands (tried all of these(

--keep-allele-order
--real-ref-alleles

--a1-allele [filename] {A1 allele col. number} {variant ID col.} {skip}
  (aliases: --reference-allele, --update-ref-allele)

--a2-allele [filename] {A2 allele col. number} {variant ID col.} {skip

I am still having no luck and I am still getting this reference allele mismatch at this position.

Any help would be greatly appreciated!!

imputation plink genetics • 3.4k views

ADD COMMENT • link updated 5.6 years ago by Kevin Blighe 87k • written 5.7 years ago by marcus.tutert • 0

0

Entering edit mode

Hi, I got the same problem. That is because when you were converting plink to vcf, probably using plink --recode vcf, "The A2 allele is saved as the reference and normally flagged as not based on a real reference genome ('PR' INFO field value)." See plink explanation here. That's saying the --recode vcf command switched the alleles in your updated bim file, where A1 is set as ref allele.

The solution is:

plink --bfile your_data --recode vcf --a2-allele Force-Allele1-DATA.QC-1000G.txt

Force-Allele1-DATA.QC-1000G.txt is the file that saves RsID and ref_allele (which can be generated by HRC checking tool)

ADD REPLY • link 4.9 years ago by Tao ▴ 530

score 0 · Answer 1 · 2018-09-09

If your data is sourced from a genotyping array, then variants will be reported on both the forward and negative strand. You can look to the white [tech] papers for the array on the manufacturer's website in order to infer which genotypes these may be.

If your data is NGS-derived, then you'll have a nasty mixture of variants that do and don't match the reference genome, depending on which reference genome was used during alignment and variant calling that produced the VCF in the first place.

Note that, given a VCF, you can simply remove any variants not matching the reference with:

bcftools norm -m-any MyVariants.vcf | \
bcftools norm --check-ref x --fasta-ref human_g1k_v37.fasta -Ov > MyVariants.refchecked.vcf

The first part before the pipe splits multi-allelic sites. The second part then checks each REF allele in your VCF. If it does not match the reference genome, human_g1k_v37.fasta, then the variant site is instructed to be removed from the VCF with --check-ref x

There are other options for --check-ref, such as:

s, change the REF base to match reference genome
w, simply issue a warning
e, call error and exit

Kevin

NB - there's another person suffering from this issue, here: Help with Sanger Imputation Service