How To Download Genotype Data From The 1000Genomes Project For A Single Gene?
1
1
Entering edit mode
11.2 years ago
User 7433 ▴ 160

Hi there,

I need to download genotype data from the 1000genomes project for a single gene. I understand there is a tool called the 'data slicer' that allows you to take a chunk from a VCF file to access only what you need.

But...the trouble is I have absolutely no idea how to do this, or even download or open a VCF file (what is a VCF file?!). Is there anyone out there kind enough to help me do this? I need someone to explain in as SIMPLE terms as possible because I am really unsure of how to do this - and the thing is, this is crucial to my thesis - I need this data so bad!

genome vcf • 8.2k views
ADD COMMENT
0
Entering edit mode

Thanks for the response..

Okay - so to start with, how do I locate the VCF file that contains the population genotype data for the CYP3A4 gene which is located on chromosome 7.

Once located - how do I download and open a VCF file so that the populations genotypes are visible and transferable, say to excel, for me to then incorporate with my own data.

And what do you mean my 'query it through tabix'

x

ADD REPLY
6
Entering edit mode
11.2 years ago

next generation sequence technologies have come with several bioinformatics advances, one of them being VCF files, a fast and efficient way of making large variant data accessible by allowing its indexing. they are easily queried by tools like tabix, which allows the easy retrieval of portions of these pressumably huge VCF files without having to download them locally.

I guess that in order to accomplish your thesis so crucial requirement (if this is part of your thesis you'll definitely have to read something about 1000genomes, vcf files, NGS and SAMtools at least in order to describe all this process) you will have to look for the 1000genomes' VCF file you are interested in (probably latest chr7 genotypes release), and then query it through tabix (you should follow this link and read the short yet useful description of this software to understand what it does and how to do it) using the gene's coordinates you may already know (since latest 1000genomes data is hg19 based, these would be 99354583-99381811 from what I see at NCBI).

once you download tabix from the appropriate SAMtools sourceforge section, you'll be able to obtain the data you need by running this simple command on your linux console (note that you won't have to download the entire ~1GB chr7 file, as tabix will retrieve for you only the data you need):

tabix -p vcf ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20101123/interim_phase1_release/ALL.chr7.phase1.projectConsensus.genotypes.vcf.gz chr7:99354583-99381811
ADD COMMENT
0
Entering edit mode

You are too kind Jorge!

ADD REPLY

Login before adding your answer.

Traffic: 635 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6