Correct way to convert plink files to vcf assigning the correct reference allele
1
0
Entering edit mode
3.7 years ago
irieljoerin ▴ 40

Hi all, I am trying to perform ancestry analysis on SNP microarray data of admixed populations and I need to convert plink format files (bed or ped, I have both) to vcf format. I was wandering wich is the correct way to do this convertion in order to keep the correct reference allele in the final vcf. I have read some posts but they seem to be pretty old. I also have read the plink website in order to get some clues but I´m not sure of how to proceed. Could anybody give some guide or advice to achieve this? I will be very gratefull! Thanks in advance.

SNP plink vcf • 4.1k views
ADD COMMENT
1
Entering edit mode
3.7 years ago

You need a source of correct reference alleles: either a .fa file for the same reference genome, or another VCF with the reference alleles you want. With a .fa file, you can then use plink 2.0's --ref-from-fa flag; with a VCF, use --ref-allele (documented at the same link).

Note that, whenever you care about REF/ALT allele order, you should use plink 2.0 instead of 1.x whenever possible, since plink 1.x switches allele order to major/minor whenever not explicitly told otherwise.

ADD COMMENT
0
Entering edit mode

Hi chrchang523, Thank you for your answer. I run: ./plink2 --bfile cromXplus_int2 --ref-from-fa human_g1k_v37_decoy.fasta --make-bed --out cromXplus_int2_fa And it threw the following:

PLINK v2.00a3LM 64-bit Intel (27 Jul 2020) www.cog-genomics.org/plink/2.0/ (C) 2005-2020 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to cromXplus_int2_fa.log. Options in effect: --bfile cromXplus_int2 --make-bed --out cromXplus_int2_fa --ref-from-fa human_g1k_v37_decoy.fasta

Start time: Tue Aug 25 16:06:28 2020 Warning: Filename-argument form of --ref-from-fa is deprecated. Use --fa to specify the .fa file instead. 7861 MiB RAM detected; reserving 3930 MiB for main workspace. Using up to 4 compute threads. 479 samples (239 females, 216 males, 24 ambiguous; 479 founders) loaded from cromXplus_int2.fam. 2804 variants loaded from cromXplus_int2.bim. Note: No phenotype data present. --ref-from-fa: 629 variants changed, 2139 validated. Writing cromXplus_int2_fa.fam ... done. Writing cromXplus_int2_fa.bim ... done. Writing cromXplus_int2_fa.bed ... done. End time: Tue Aug 25 16:07:09 2020

How I should interpret this result: 629 variants changed, 2139 validated? What happened with the remaining 36 variants? And what does the warning: Filename-argument form of --ref-from-fa is deprecated. Use --fa to specify the .fa file instead, mean? Thanks in advance for your clarification!

Iriel

ADD REPLY
1
Entering edit mode

Possible causes of the 36 variants:

  • Contigs not present in or named differently in the .fasta file (how is the pseudoautosomal region encoded?)
  • Indels
  • Strand-flips (arguably the most annoying)

As for the warning, it should go away if you tweak the --ref-from-fa part of the command to "--fa human_g1k_v37_decoy.fasta --ref-from-fa".

ADD REPLY

Login before adding your answer.

Traffic: 2736 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6