What is the most complete vcf file of population allele frequencies that can be built/downloaded from public datasets nowadays?
About 5 years or so ago, it used to be the latest release of the CSHL HapMap 12 populations that were part of the 1000 genomes project. These were:
CHB CHD MEX GIH TSI LWK ASW MKK HCB JPT CEU YRI
For example, the EnsEMBL project currently has these 12 populations available as a vcf here:
http://ftp.ensembl.org/pub/current_variation/vcf/homo_sapiens/ CSHL-HAPMAP-HAPMAP-CHB.vcf.gz CSHL-HAPMAP-HAPMAP-CHD.vcf.gz CSHL-HAPMAP-HAPMAP-MEX.vcf.gz CSHL-HAPMAP-HAPMAP-GIH.vcf.gz CSHL-HAPMAP-HAPMAP-TSI.vcf.gz CSHL-HAPMAP-HAPMAP-LWK.vcf.gz CSHL-HAPMAP-HAPMAP-ASW.vcf.gz CSHL-HAPMAP-HAPMAP-MKK.vcf.gz CSHL-HAPMAP-HapMap-HCB.vcf.gz CSHL-HAPMAP-HapMap-JPT.vcf.gz CSHL-HAPMAP-HapMap-CEU.vcf.gz CSHL-HAPMAP-HapMap-YRI.vcf.gz
Each SNP has an AF entry from which a multi-populations vcf with rsids, alleles and frequencies can be built, where the AFs are such as AF_CHB, AF_CHD, etc.
The 1000 genomes project populations documentation describes more than these 12 populations, but I haven't seen equivalent population vcfs with AFs built from the individuals within the population for the remainder of these, apart from the original HapMap 12 marked with a
###Populations and codes * CHB Han Chinese Han Chinese in Beijing, China * JPT Japanese Japanese in Tokyo, Japan CHS Southern Han Chinese Han Chinese South CDX Dai Chinese Chinese Dai in Xishuangbanna, China KHV Kinh Vietnamese Kinh in Ho Chi Minh City, Vietnam * CHD Denver Chinese Chinese in Denver, Colorado (pilot 3 only) CEU CEPH Utah residents (CEPH) with Northern and Western European ancestry * TSI Tuscan Toscani in Italia GBR British British in England and Scotland FIN Finnish Finnish in Finland IBS Spanish Iberian populations in Spain * YRI Yoruba Yoruba in Ibadan, Nigeria * LWK Luhya Luhya in Webuye, Kenya GWD Gambian Gambian in Western Division, The Gambia MSL Mende Mende in Sierra Leone ESN Esan Esan in Nigeria * ASW African-American SW African Ancestry in Southwest US ACB African-Caribbean African Caribbean in Barbados MXL Mexican-American Mexican Ancestry in Los Angeles, California PUR Puerto Rican Puerto Rican in Puerto Rico CLM Colombian Colombian in Medellin, Colombia PEL Peruvian Peruvian in Lima, Peru * GIH Gujarati Gujarati Indian in Houston, TX PJL Punjabi Punjabi in Lahore, Pakistan BEB Bengali Bengali in Bangladesh STU Sri Lankan Sri Lankan Tamil in the UK ITU Indian Indian Telugu in the UK
I presume there is more public data nowadays that can be accessed to build a more complete vcf files of allele frequencies per population, with as many populations as possible to use as a reference dataset with new data.
Thanks in advance.