1000 Genomes phase 3 sample numbers do not add up
0
0
Entering edit mode
4.5 years ago

Hi all!

tl;dr: I am confused of the origin of the 1000 genomes phase 3 ALL-vcf-file, jump to the last paragraph if you are only interested in the question without the background.

I tried to download the vcf-files of phase 3 for a certain population of the 1000 genomes project. Let us assume I am interested in the Utah residents of European ancestry (CEU/CEPH). Since i want it for whole chromosomes and not just regions as allowed by the Data Slicer I went ahead and downloaded the ALL-vcf-files (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz) which include all samples and then filtered that for a list of samples that corresponded to the CEPH population. I obtained this list by going to https://www.internationalgenome.org/data-portal/sample and choosing the CEPH population, then clicking "Download the list" and using the first column (the sample names).

Unfortunately, the ALL-vcf-file does not include all CEPH samples of phase 3, even though it claims to (it literally says phase3 in the file name). The list for CEPH samples contains 183 Phase 3 samples, only 99 of which are listed in the ALL-vcf-file (of Chromosome 22, for reference). The same is going on with Toscani population. Here, 112 Phase 3 samples are listed on the site, but only 107 of which are included in the Phase 3 ALL-vcf-file. Interestingly the numbers in the ALL-vcf-file correspond to the number of samples not in Phase 3, but in "30x GRCh38" on the site, which brings me to the actual question:

Are the samples in ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz from Phase 3 or "30x GRCh38"? I thought that Phase 3 was with GRCh37.p13 (as stated in the data slicer (http://grch37.ensembl.org/Homo_sapiens/Tools/DataSlicer?db=core)) or am i mistaken here? And it they are really from Phase 3, why are they missing samples? And if not, where's the actual Phase 3 data? After all, it is pretty confusing.

I hope you guys can help me out, thanks in advance.

Cheers, Florin

genome 1000genomes vcf population • 1.3k views
ADD COMMENT

Login before adding your answer.

Traffic: 2571 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6