Question

How To Download Genotype Data From The 1000Genomes Project For A Single Gene?

1

Entering edit mode

12.7 years ago

User 7433 ▴ 170

Hi there,

I need to download genotype data from the 1000genomes project for a single gene. I understand there is a tool called the 'data slicer' that allows you to take a chunk from a VCF file to access only what you need.

But...the trouble is I have absolutely no idea how to do this, or even download or open a VCF file (what is a VCF file?!). Is there anyone out there kind enough to help me do this? I need someone to explain in as SIMPLE terms as possible because I am really unsure of how to do this - and the thing is, this is crucial to my thesis - I need this data so bad!

genome vcf • 8.8k views

ADD COMMENT • link updated 12.7 years ago by Jorge Amigo 14k • written 12.7 years ago by User 7433 ▴ 170

0

Entering edit mode

Thanks for the response..

Okay - so to start with, how do I locate the VCF file that contains the population genotype data for the CYP3A4 gene which is located on chromosome 7.

Once located - how do I download and open a VCF file so that the populations genotypes are visible and transferable, say to excel, for me to then incorporate with my own data.

And what do you mean my 'query it through tabix'

x

ADD REPLY • link 12.7 years ago by User 7433 ▴ 170

score 6 · Answer 1 · 2011-08-03

next generation sequence technologies have come with several bioinformatics advances, one of them being VCF files, a fast and efficient way of making large variant data accessible by allowing its indexing. they are easily queried by tools like tabix, which allows the easy retrieval of portions of these pressumably huge VCF files without having to download them locally.

I guess that in order to accomplish your thesis so crucial requirement (if this is part of your thesis you'll definitely have to read something about 1000genomes, vcf files, NGS and SAMtools at least in order to describe all this process) you will have to look for the 1000genomes' VCF file you are interested in (probably latest chr7 genotypes release), and then query it through tabix (you should follow this link and read the short yet useful description of this software to understand what it does and how to do it) using the gene's coordinates you may already know (since latest 1000genomes data is hg19 based, these would be 99354583-99381811 from what I see at NCBI).

once you download tabix from the appropriate SAMtools sourceforge section, you'll be able to obtain the data you need by running this simple command on your linux console (note that you won't have to download the entire ~1GB chr7 file, as tabix will retrieve for you only the data you need):

tabix -p vcf ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/ALL.chr7.phase1.projectConsensus.genotypes.vcf.gz chr7:99354583-99381811