Question: How To Filter Vcf Files From 1000 Genomes Release V3.2010-11 (Alternative Source)?
0
gravatar for user56
6.9 years ago by
user56290
United States
user56290 wrote:

I want to use VCF files from WGS to arrive at pharmacogenomics clinical recommendations (relevant to a single patient, not a population).

I decided to use VCF as standard for input data and 1000 genomes as the test population. I belive the files I need are here: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/ (if not, please comment on that)

The problem is that the files are too big. For example chromosome 6 data for all populations is 9 GB big. All chomomosomes data would be 80+ GB.

Example of chr6 file: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/release/20110521/ALL.chr6.phase1_release_v3.20101123.snps_indels_svs.genotypes.vcf.gz

Is there any alternative source where the 1000genomes data would be in different shapes?

What I would want would be to:

  • use only call coming from SNPSOURCE=EXOME
  • I would like to filter the file only to known SNPs (within dbSNP)
  • Filter out all INDELS.
  • Make the number of genomes smaller (e.g., 1 patient or no more than 50 patients)

Is the only way to download 80GB, let it crunch for a long time? (and for me also improve my linux knowledge (I am windows and SQL and R person). Any advice greatly appreciated.

p.s. I seems all genomic stuff is in files. I am good with large databases and could do what I need much easier in a database. After all, a VCF file is like a database table.

vcf 1000genomes • 2.2k views
ADD COMMENTlink written 6.9 years ago by user56290
1
gravatar for Laura
6.9 years ago by
Laura1.7k
Cambridge UK
Laura1.7k wrote:

You could use tabix to stream these files from the ftp site and filter the sites you don't want out. You could also reduce the number of individuals if you wanted aswell

We have more info about how to use tabix on in our faq http://www.1000genomes.org/faq/how-do-i-get-sub-section-vcf-file

ADD COMMENTlink written 6.9 years ago by Laura1.7k
0
gravatar for hershman
6.9 years ago by
hershman40
Cambridge, Boston MA, United States
hershman40 wrote:

I was unable to find exome calls from the 1000 genomes project about a month back. One option to avoid downloading the files is to play with them on Amazon

ADD COMMENTlink modified 6.9 years ago • written 6.9 years ago by hershman40
0
gravatar for thamathpanda
6.9 years ago by
thamathpanda40
science
thamathpanda40 wrote:

VCFtools bro

A database would probably be slower fyi.

ADD COMMENTlink written 6.9 years ago by thamathpanda40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1962 users visited in the last hour