Question: what is the most complete vcf file of population allele frequencies that can be built/downloaded from public datasets?
0
gravatar for 14134125465346445
5 months ago by
United Kingdom
141341254653464453.5k wrote:

What is the most complete vcf file of population allele frequencies that can be built/downloaded from public datasets nowadays?

About 5 years or so ago, it used to be the latest release of the CSHL HapMap 12 populations that were part of the 1000 genomes project. These were:

CHB CHD MEX GIH TSI LWK ASW MKK HCB JPT CEU YRI

For example, the EnsEMBL project currently has these 12 populations available as a vcf here:

http://ftp.ensembl.org/pub/current_variation/vcf/homo_sapiens/

CSHL-HAPMAP-HAPMAP-CHB.vcf.gz
CSHL-HAPMAP-HAPMAP-CHD.vcf.gz
CSHL-HAPMAP-HAPMAP-MEX.vcf.gz
CSHL-HAPMAP-HAPMAP-GIH.vcf.gz
CSHL-HAPMAP-HAPMAP-TSI.vcf.gz
CSHL-HAPMAP-HAPMAP-LWK.vcf.gz
CSHL-HAPMAP-HAPMAP-ASW.vcf.gz
CSHL-HAPMAP-HAPMAP-MKK.vcf.gz
CSHL-HAPMAP-HapMap-HCB.vcf.gz
CSHL-HAPMAP-HapMap-JPT.vcf.gz
CSHL-HAPMAP-HapMap-CEU.vcf.gz
CSHL-HAPMAP-HapMap-YRI.vcf.gz

Each SNP has an AF entry from which a multi-populations vcf with rsids, alleles and frequencies can be built, where the AFs are such as AF_CHB, AF_CHD, etc.

The 1000 genomes project populations documentation describes more than these 12 populations, but I haven't seen equivalent population vcfs with AFs built from the individuals within the population for the remainder of these, apart from the original HapMap 12 marked with a * below:

###Populations and codes

*         CHB   Han Chinese             Han Chinese in Beijing, China
*         JPT   Japanese                Japanese in Tokyo, Japan
         CHS    Southern Han Chinese    Han Chinese South
         CDX    Dai Chinese             Chinese Dai in Xishuangbanna, China
         KHV    Kinh Vietnamese         Kinh in Ho Chi Minh City, Vietnam
*         CHD   Denver Chinese          Chinese in Denver, Colorado (pilot 3 only)

         CEU    CEPH                    Utah residents (CEPH) with Northern and Western European ancestry 
*         TSI   Tuscan                  Toscani in Italia 
         GBR    British                 British in England and Scotland 
         FIN    Finnish                 Finnish in Finland 
         IBS    Spanish                 Iberian populations in Spain 

*         YRI   Yoruba                  Yoruba in Ibadan, Nigeria
*        LWK    Luhya                   Luhya in Webuye, Kenya
         GWD    Gambian                 Gambian in Western Division, The Gambia 
         MSL    Mende                   Mende in Sierra Leone
         ESN    Esan                    Esan in Nigeria

*         ASW   African-American SW     African Ancestry in Southwest US  
         ACB    African-Caribbean       African Caribbean in Barbados
         MXL    Mexican-American        Mexican Ancestry in Los Angeles, California
         PUR    Puerto Rican            Puerto Rican in Puerto Rico
         CLM    Colombian               Colombian in Medellin, Colombia
         PEL    Peruvian                Peruvian in Lima, Peru

*         GIH   Gujarati                Gujarati Indian in Houston, TX
         PJL    Punjabi                 Punjabi in Lahore, Pakistan
         BEB    Bengali                 Bengali in Bangladesh
         STU    Sri Lankan              Sri Lankan Tamil in the UK
         ITU    Indian                  Indian Telugu in the UK

I presume there is more public data nowadays that can be accessed to build a more complete vcf files of allele frequencies per population, with as many populations as possible to use as a reference dataset with new data.

Thanks in advance.

ADD COMMENTlink modified 5 months ago by Pierre Lindenbaum124k • written 5 months ago by 141341254653464453.5k

Hello 14134125465346445!

It appears that your post has been cross-posted to another site: https://bioinformatics.stackexchange.com/questions/8838

This is typically not recommended as it runs the risk of annoying people in both communities.

ADD REPLYlink written 5 months ago by Pierre Lindenbaum124k
0
gravatar for Pierre Lindenbaum
5 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum124k wrote:

try gnomad ? https://gnomad.broadinstitute.org/downloads

ADD COMMENTlink written 5 months ago by Pierre Lindenbaum124k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1621 users visited in the last hour