Question: Loading 1000 Genomes Vcf Files In R
4
gravatar for Paul
8.5 years ago by
Paul750
United States
Paul750 wrote:

Hi,

I'm looking to get genotypes for SNPs in a particular region of the genome from CEU and YRI HapMap individuals, I need more SNPs than just those genotyped for HapMap and the 1000 genomes project has recently released this SNP call data, generated from sequence data, in VCF format.

This data is here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20101123/interim_phase1_release/

The uncompressed file (ALL.wgs.phase1.projectConsensus.snps.sites.vcf) is 11 Gigs, I'm wondering if anyone has any idea of the best way to load this and extract the genotypes from the region I need, is there a tool in R or anything else anyone could suggest for loading and dealing with this kind of data?

Thanks,

Paul

genome hapmap snp • 6.6k views
ADD COMMENTlink written 8.5 years ago by Paul750

That file does not give you genotypes. The file containing the genotype is going to be half a terabyte uncompressed, I guess.

ADD REPLYlink written 8.5 years ago by lh331k

It Appears you are correct! Have the genotypes not been released yet?

ADD REPLYlink written 8.5 years ago by Paul750

SNP data is here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20100804/supporting/

ADD REPLYlink written 8.5 years ago by Paul750

Those are a previous release based on 629 individuals

ADD REPLYlink written 8.5 years ago by Laura1.7k
4
gravatar for Pierre Lindenbaum
8.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum124k wrote:

"I'm wondering if anyone has any idea of the best way to load this and extract the genotypes from the region I need"

See tabix

Tabix indexes a TAB-delimited genome position file in.tab.bgz and creates an index file in.tab.bgz.tbi when region is absent from the command-line. The input data file must be position sorted and compressed by bgzip which has a gzip(1) like interface. After indexing, tabix is able to quickly retrieve data lines overlapping regions specified in the format "chr:beginPos-endPos". Fast data retrieval also works over network if URI is given as a file name and in this case the index file will be downloaded if it is not present locally.

ADD COMMENTlink written 8.5 years ago by Pierre Lindenbaum124k

See this thread for more info actually.

ADD REPLYlink modified 10 weeks ago by RamRS25k • written 8.5 years ago by Paul750
0
gravatar for Angel
8.4 years ago by
Angel0
United States
Angel0 wrote:

I am wondering what is the difference between this file (ALL.wgs.phase1.projectConsensus.snps.sites.vcf) and those name by chromsome?

ADD COMMENTlink written 8.4 years ago by Angel0

please, ask a new question.

ADD REPLYlink written 8.4 years ago by Pierre Lindenbaum124k
0
gravatar for Fede
8.4 years ago by
Fede10
Fede10 wrote:

"sites.vcf" doesn't contain any data about the genotypes ;)

by the way, are you experiencing too right now some .tbi files "incorrect" data? (90 bytes as size is way too small...)

ADD COMMENTlink written 8.4 years ago by Fede10

that problem has not been fixed

ADD REPLYlink written 8.4 years ago by Laura1.7k
0
gravatar for zhanxw
7.2 years ago by
zhanxw20
United States
zhanxw20 wrote:

You can try use vcf2geno http://cran.r-project.org/web/packages/vcf2geno/index.html It takes tabix-indexed vcf file and extract genotypes for you.

ADD COMMENTlink written 7.2 years ago by zhanxw20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2003 users visited in the last hour