Question: Where to find 1000 Genome phase 3 whole genome data and select only European population
0
gravatar for Opal
2.2 years ago by
Opal0
Opal0 wrote:

Hello:

I was trying to download whole genome data from 1000Genome phase 3 data and extract only the EUR population (GBR, TSI, FIN, IBS, CEU). I used the ftp site:

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz,

but apparently it is not the file I need, the error message says:

Error: No samples in .vcf file.

My question is where do I get the whole genome 1000Genome phase 3 data. Also, I checked Data slicer from EnsemblGRCh37, it allows population selection, but the maximum genome region to be extracted is 2.5Mb, so I can't get the whole genome data even if I succeed in downloading the whole genome dataset from the above ftp site (assume if it exists).

Opal

ADD COMMENTlink modified 2.2 years ago by regmkbl66k • written 2.2 years ago by Opal0
0
gravatar for regmkbl
2.2 years ago by
regmkbl66k
regmkbl66k wrote:

The file that you want to download is 1.8 gigabytes. It will take a while to download, depending on your connection. Ensure that it downloads completely before trying to use it.

To view data in a vcf.gz file, use zcat or bcftools view, or just unzip it.


You can also download the data on a per-chromosome basis:

prefix="ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr" ;

suffix=".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz" ;

for chr in {1..22} X; do
    wget $prefix$chr$suffix $prefix$chr$suffix.tbi ;
done

You can then merge those into a single file or keep them separate. Either way, then download the 1000 Genomes PED file, which you can use for obtaining IDs for the purposes of filtering:

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped ;

Note that if you use mac, wget may not be installed. You can install it with brew install wget

Kevin

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by regmkbl66k

Hi Kevin,

I actually think the file

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz

is not the right file, because it says 'no sample in .vcf file.

The naming of this file is also different from the other chromsome-specific files as listed below (sites.vcf.gz instead of genotypes.vcf.gz

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Opal0

Hi Kevin,

I have downloaded the vcf.gz files for all the chromosomes 1-22 (I don't need X and Y). But they are pretty big files, is there any way to concatenate them without unzipping? Also, would you be able to elaborate how to use the .ped file to extract EUR (GBR, CEU, TSI, IBS, FIN) only population?

Opal

ADD REPLYlink written 2.2 years ago by Opal0

Hey Opal. Yes, I would even recommend converting them to BCF (binary call format), which saves even more space. You can then again use BCFtools to concatenate them, e.g., bcftools concat.

ADD REPLYlink written 2.2 years ago by regmkbl66k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1283 users visited in the last hour