Question

How to load 1000 genome vcf and tbi file into R

0

Entering edit mode

4.2 years ago

lxiao63 • 0

Dear all,

I am very much a beginner in genetic data analysis. I am recently trying to learn to perform GWAS in R through the article "A guide to genome-wide association analysis and post-analytic interrogation". During SNP imputation, the authors used SNP data on Chr16 for demonstration. The authors used read.pedfile function in snpStats package to load "chr16_1000g_CEU.ped" and "chr16_1000g_CEU.info" files into R (files publicly available from https://www.mtholyoke.edu/courses/afoulkes/Data/GWAStutorial/).

I wish to find 1000 g SNP data for other chromesomes. From ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets/, I found vcf.gz and vcf.gz.tbi files associated with each chromosome. For example, for chromosome 16, I found "ALL.chr16.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz" and "ALL.chr16.integrated_phase1_v3.20101123.snps_indels_svs.genotypes.vcf.gz.tbi".

My questions are:

Are the vcf.gz and vcf.gz.tbi files for Chr16 I found equivalent to the "chr16_1000g_CEU.ped" and "chr16_1000g_CEU.info" files the authors provided? If yes, I may just download SNP data for other chromosomes for my own GWAS.
I understand the vcf.gz file contains genotype information and vcf.gz.tbi contain position information. I tried to load these two files which I downloaded from 1000 g webpage into R but I failed. I also resorted to an 8-year-old post in Biostars (Loading 1000 Genomes Vcf Files In R) but it did not work. My guess is that the vcf.gz file is analogous to the "chr16_1000g_CEU.ped" in the paper and the vcf.gz.tbi file is analogous to the "chr16_1000g_CEU.info" file. But I did not find ways to convert vcf.gz to .ped and vcf.gz.tbi to .info before loading into R. Nor did I find methods that can load vcf.gz and vcf.gz.tbi directly into R. Any solution is welcome.

Thanks, Patrick Lv

SNP R gene • 2.7k views

ADD COMMENT • link updated 4.2 years ago by Sam ★ 4.7k • written 4.2 years ago by lxiao63 • 0

score 1 · Answer 1 · 2020-02-16

1

Entering edit mode

4.2 years ago

Sam ★ 4.7k

No. They are of different format
if you want to work with the ped file, then you will need to convert them using PLINK. Or you can directly download the data here

ADD COMMENT • link 4.2 years ago by Sam ★ 4.7k

0

Entering edit mode

Thank you Sam.

Is there a way to convert vcf.gz and vcf.gz.tbi to .ped and .info with R? I have no previous experience with PLINK. Of course I may have to learn PLINK if I have to.
I downloaded SNP data from the webpage you suggested. Taking Chr16 as an example, I downloaded 1kg_phase1_chr16.tar.gz file, which is first decompressed to a .tar file, which is further decompressed to 3 files in .bed, .bim, and .fam format. I then used the code below to load the three files into R:

library(snpStats); path <- "D:\Downloads"; snps <- read.plink(file.path(path, "1kg_phase1_chr16"), na.strings = ("-9"))

However, I was returned the error message: Error in .rowNamesDF<-(x, value = value) : duplicate 'row.names' are not allowed In addition: Warning message: non-unique value when setting 'row.names': ‘.’

The above code worked well when I was loading the bed/bim/fam files for the GWAS paper I mentioned earlier but not for those downloaded from your recommended page. It would be highly appreciated if anyone could help with this error. Thank you.

ADD REPLY • link 4.2 years ago by lxiao63 • 0

0

Entering edit mode

It'd be much easier to use PLINK than to use R for this type of analysis.

The error message suggested some of the SNPs has duplicated name (e.g. ".", which is typical when you convert vcf to PLINK format). You will need to do some preprocessing beforehand. Might want to look into the PLINK manual page

ADD REPLY • link 4.2 years ago by Sam ★ 4.7k