Loading 1000 Genomes Vcf Files In R
4
4
Entering edit mode
12.8 years ago
Paul ▴ 760

Hi,

I'm looking to get genotypes for SNPs in a particular region of the genome from CEU and YRI HapMap individuals, I need more SNPs than just those genotyped for HapMap and the 1000 genomes project has recently released this SNP call data, generated from sequence data, in VCF format.

This data is here: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20101123/interim_phase1_release/

The uncompressed file (ALL.wgs.phase1.projectConsensus.snps.sites.vcf) is 11 Gigs, I'm wondering if anyone has any idea of the best way to load this and extract the genotypes from the region I need, is there a tool in R or anything else anyone could suggest for loading and dealing with this kind of data?

Thanks,

Paul

genome hapmap snp • 8.4k views
ADD COMMENT
0
Entering edit mode

That file does not give you genotypes. The file containing the genotype is going to be half a terabyte uncompressed, I guess.

ADD REPLY
0
Entering edit mode

It Appears you are correct! Have the genotypes not been released yet?

ADD REPLY
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Those are a previous release based on 629 individuals

ADD REPLY
4
Entering edit mode
12.8 years ago

"I'm wondering if anyone has any idea of the best way to load this and extract the genotypes from the region I need"

See tabix

Tabix indexes a TAB-delimited genome position file in.tab.bgz and creates an index file in.tab.bgz.tbi when region is absent from the command-line. The input data file must be position sorted and compressed by bgzip which has a gzip(1) like interface. After indexing, tabix is able to quickly retrieve data lines overlapping regions specified in the format "chr:beginPos-endPos". Fast data retrieval also works over network if URI is given as a file name and in this case the index file will be downloaded if it is not present locally.

ADD COMMENT
0
Entering edit mode

See this thread for more info actually.

ADD REPLY
0
Entering edit mode
12.8 years ago
Angel • 0

I am wondering what is the difference between this file (ALL.wgs.phase1.projectConsensus.snps.sites.vcf) and those name by chromsome?

ADD COMMENT
0
Entering edit mode

please, ask a new question.

ADD REPLY
0
Entering edit mode
12.8 years ago
Fede ▴ 10

"sites.vcf" doesn't contain any data about the genotypes ;)

by the way, are you experiencing too right now some .tbi files "incorrect" data? (90 bytes as size is way too small...)

ADD COMMENT
0
Entering edit mode

that problem has not been fixed

ADD REPLY
0
Entering edit mode
11.6 years ago
zhanxw ▴ 20

You can try use vcf2geno http://cran.r-project.org/web/packages/vcf2geno/index.html It takes tabix-indexed vcf file and extract genotypes for you.

ADD COMMENT

Login before adding your answer.

Traffic: 1522 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6