I need help to download some SNP data from HapMap
Entering edit mode
5.0 years ago

I am totally new in Bioinformatics and I would like to apply my knowledge in feature selection on the tag SNP problems.

To do that, I've read a lot of papers and books in order to understand the main concepts and how the algorithms to this task work.

However, is been very hard for me to understand the organization of the huge amount of genomic data available over the internet.. HapMap, 1000 Genomes, dbSNP and so on... (!) I downloaded GB of data from different sources, but I really do not know what to do with then.


To be more specific in my question:

First I would like to run a Genetic Algorithm to find the tag SNP based on a simple linkage disequilibrium coverage measure. The authors of this GA used the following data

Three published SNP data sets were downloaded from HapMap (http://www.hapmap.org/) (NOT AVAILABLE ANYMORE) for evaluating the prediction accuracies achievable by our method:

• ENr112. 343 SNPs genotyped in population ASW (African ancestry in Southwest USA) on chromosome 2 from position 51512208 to 52012208. The genotypes of 83 individuals were used.

• ENr113. 380 SNPs genotyped in population ASW on chromosome 4 from position 118466103 to 118966103. The genotypes of 83 individuals were used.

• ENm013. 683 SNPs genotyped in population ASW on chromosome 7 from position 89621624 to 90736048. The genotypes of 83 individuals were used.

Besides that, the authors stated that

We represent a genotype g_i of length L by a sequence over {0, 1, 2}^L (since we assume only bi-allelic SNPs). 0 and 1 stand for the minor and major homozygous types respectively, meaning that the two alleles at that SNP locus are either 0 or 1; the major and minor alleles are determined by the frequency at which each allele appears in the population. A value of 2 denotes the heterozygous type for an SNP having two different alleles.

And also

To obtain phasing of genotypes, we used the GEVALT algorithm.

I believe they obtain the aforementioned data in genotype format (something with 0, 1 and 2 values only) and then phasing it to apply in the problem. The image below is a example of this process.

from paper


  1. How and where can I download these data?
  2. I know HapMap is not available due to a security issue but there are the FTP, but the FTP folders are really complex to understand. Can someone guide me to download data from there?

I hope someone can help me. Thanks!

SNP hapmap • 3.5k views
Entering edit mode
5.0 years ago

Yes, as you mentioned, there was a security issue with the NCBI's HapMap web-site, so, it was taken down in 2016:

NCBI retiring HapMap Resource

June 16, 2016

A recent computer security audit has revealed security flaws in the legacy HapMap site that require NCBI to take it down immediately.

[source: https://www.ncbi.nlm.nih.gov/variation/news/NCBI_retiring_HapMap/]


One thing to note is that sequencing was not part of the HapMap projects - they were genotyped by Affymetrix microarrays. Sequencing was definitively part of the 1000 Genomes data, though.

I'm unsure why you have to rigidly adhere to the methods in that published manuscript (?). There are many ways to determine tagging SNPs. See my recent answer here: A: Tag SNPs - how to easily and effectively select them

If I were you, I would obtain the HapMap3 data in PLINK format, which is available from HERE. Then, following the first few steps of my tutorial (here: Produce PCA bi-plot for 1000 Genomes Phase III in VCF format ), you could also get the 1000 Genomes Phase III data in PLINK format. After that, you just merge the datasets and calculate tagging SNPs or output the data in a format like HaploView.


Entering edit mode

Kevin thanks for your answer! I'm interested in bioinspired algorithms applied in the tag SNP selection problem. I cited this manuscript as a example since the others I have reviewed consider basically the same input data format (something with 0, 1 and 2s). Do you know if I can find genotype data in this format or if I have to download a VCF (for example) and infer the homozygous and heterozygous values?

Even if I download the data in VCF, PLINK or other formats as you suggested, I do not know how to filter them to an specific population and position. Also the most of the papers I've read considerer the ENCODE regions from HapMap (ENm013, ENr113...), but I cannot found it to download.

Entering edit mode

It seems that you will just have to treat this as a big learning exercise. Obtaining and manipulating data into formats that you want for your analysis is a big part of bioinformatics. You will very rarely find data in the exact format that you want.

I would still go the PLINK route and, whilst I know all of the steps in order to merge data with PLINK, it's just something that you will have to learn. There are many posts on Biostars on these topics (just search via a search engine).

If you're looking at a published manuscript and it's in a reputable journal, then everything should be clearly stated in the methods. For example, searching for 1 minute, I was able to find this page, which has your ENCODE region co-ordinates: https://genome.ucsc.edu/ENCODE/regions.html

The middle co-ordinates for each are hg38. If you need to use hg19, then use the UCSC LiftOver tool.

Trust that this helps


Login before adding your answer.

Traffic: 1483 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6