Question: Where to find 1000 Genome phase 3 whole genome data and select only European population
0
gravatar for Opal
14 months ago by
Opal0
Opal0 wrote:

Hello:

I was trying to download whole genome data from 1000Genome phase 3 data and extract only the EUR population (GBR, TSI, FIN, IBS, CEU). I used the ftp site:

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz,

but apparently it is not the file I need, the error message says:

Error: No samples in .vcf file.

My question is where do I get the whole genome 1000Genome phase 3 data. Also, I checked Data slicer from EnsemblGRCh37, it allows population selection, but the maximum genome region to be extracted is 2.5Mb, so I can't get the whole genome data even if I succeed in downloading the whole genome dataset from the above ftp site (assume if it exists).

Opal

ADD COMMENTlink modified 14 months ago by Kevin Blighe49k • written 14 months ago by Opal0
0
gravatar for Kevin Blighe
14 months ago by
Kevin Blighe49k
Kevin Blighe49k wrote:

The file that you want to download is 1.8 gigabytes. It will take a while to download, depending on your connection. Ensure that it downloads completely before trying to use it.

To view data in a vcf.gz file, use zcat or bcftools view, or just unzip it.


You can also download the data on a per-chromosome basis:

prefix="ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr" ;

suffix=".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz" ;

for chr in {1..22} X; do
    wget $prefix$chr$suffix $prefix$chr$suffix.tbi ;
done

You can then merge those into a single file or keep them separate. Either way, then download the 1000 Genomes PED file, which you can use for obtaining IDs for the purposes of filtering:

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/working/20130606_sample_info/20130606_g1k.ped ;

Note that if you use mac, wget may not be installed. You can install it with brew install wget

Kevin

ADD COMMENTlink modified 14 months ago • written 14 months ago by Kevin Blighe49k

Hi Kevin,

I actually think the file

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.wgs.phase3_shapeit2_mvncall_integrated_v5b.20130502.sites.vcf.gz

is not the right file, because it says 'no sample in .vcf file.

The naming of this file is also different from the other chromsome-specific files as listed below (sites.vcf.gz instead of genotypes.vcf.gz

ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

ADD REPLYlink modified 14 months ago • written 14 months ago by Opal0

Hi Kevin,

I have downloaded the vcf.gz files for all the chromosomes 1-22 (I don't need X and Y). But they are pretty big files, is there any way to concatenate them without unzipping? Also, would you be able to elaborate how to use the .ped file to extract EUR (GBR, CEU, TSI, IBS, FIN) only population?

Opal

ADD REPLYlink written 14 months ago by Opal0

Hey Opal. Yes, I would even recommend converting them to BCF (binary call format), which saves even more space. You can then again use BCFtools to concatenate them, e.g., bcftools concat.

ADD REPLYlink written 14 months ago by Kevin Blighe49k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 757 users visited in the last hour