Question: How to load 1000 genome vcf and tbi file into R
gravatar for lxiao63
11 months ago by
lxiao630 wrote:

Dear all,

I am very much a beginner in genetic data analysis. I am recently trying to learn to perform GWAS in R through the article "A guide to genome-wide association analysis and post-analytic interrogation". During SNP imputation, the authors used SNP data on Chr16 for demonstration. The authors used read.pedfile function in snpStats package to load "chr16_1000g_CEU.ped" and "" files into R (files publicly available from

I wish to find 1000 g SNP data for other chromesomes. From, I found vcf.gz and vcf.gz.tbi files associated with each chromosome. For example, for chromosome 16, I found "ALL.chr16.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz" and "ALL.chr16.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz.tbi".

My questions are:

  1. Are the vcf.gz and vcf.gz.tbi files for Chr16 I found equivalent to the "chr16_1000g_CEU.ped" and "" files the authors provided? If yes, I may just download SNP data for other chromosomes for my own GWAS.

  2. I understand the vcf.gz file contains genotype information and vcf.gz.tbi contain position information. I tried to load these two files which I downloaded from 1000 g webpage into R but I failed. I also resorted to an 8-year-old post in Biostars (Loading 1000 Genomes Vcf Files In R) but it did not work. My guess is that the vcf.gz file is analogous to the "chr16_1000g_CEU.ped" in the paper and the vcf.gz.tbi file is analogous to the "" file. But I did not find ways to convert vcf.gz to .ped and vcf.gz.tbi to .info before loading into R. Nor did I find methods that can load vcf.gz and vcf.gz.tbi directly into R. Any solution is welcome.

Thanks, Patrick Lv

snp R gene • 457 views
ADD COMMENTlink modified 11 months ago by Sam3.3k • written 11 months ago by lxiao630
gravatar for Sam
11 months ago by
New York
Sam3.3k wrote:
  1. No. They are of different format
  2. if you want to work with the ped file, then you will need to convert them using PLINK. Or you can directly download the data here
ADD COMMENTlink written 11 months ago by Sam3.3k

Thank you Sam.

  1. Is there a way to convert vcf.gz and vcf.gz.tbi to .ped and .info with R? I have no previous experience with PLINK. Of course I may have to learn PLINK if I have to.

  2. I downloaded SNP data from the webpage you suggested. Taking Chr16 as an example, I downloaded 1kg_phase1_chr16.tar.gz file, which is first decompressed to a .tar file, which is further decompressed to 3 files in .bed, .bim, and .fam format. I then used the code below to load the three files into R:

    library(snpStats); path <- "D:\Downloads"; snps <- read.plink(file.path(path, "1kg_phase1_chr16"), na.strings = ("-9"))

However, I was returned the error message: Error in .rowNamesDF<-(x, value = value) : duplicate 'row.names' are not allowed In addition: Warning message: non-unique value when setting 'row.names': ‘.’

The above code worked well when I was loading the bed/bim/fam files for the GWAS paper I mentioned earlier but not for those downloaded from your recommended page. It would be highly appreciated if anyone could help with this error. Thank you.

ADD REPLYlink modified 11 months ago • written 11 months ago by lxiao630

It'd be much easier to use PLINK than to use R for this type of analysis.

The error message suggested some of the SNPs has duplicated name (e.g. ".", which is typical when you convert vcf to PLINK format). You will need to do some preprocessing beforehand. Might want to look into the PLINK manual page

ADD REPLYlink written 11 months ago by Sam3.3k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1565 users visited in the last hour