I am totally new in Bioinformatics and I would like to apply my knowledge in feature selection on the tag SNP problems.
To do that, I've read a lot of papers and books in order to understand the main concepts and how the algorithms to this task work.
However, is been very hard for me to understand the organization of the huge amount of genomic data available over the internet.. HapMap, 1000 Genomes, dbSNP and so on... (!) I downloaded GB of data from different sources, but I really do not know what to do with then.
--
To be more specific in my question:
First I would like to run a Genetic Algorithm to find the tag SNP based on a simple linkage disequilibrium coverage measure. The authors of this GA used the following data
Three published SNP data sets were downloaded from HapMap (http://www.hapmap.org/) (NOT AVAILABLE ANYMORE) for evaluating the prediction accuracies achievable by our method:
• ENr112. 343 SNPs genotyped in population ASW (African ancestry in Southwest USA) on chromosome 2 from position 51512208 to 52012208. The genotypes of 83 individuals were used.
• ENr113. 380 SNPs genotyped in population ASW on chromosome 4 from position 118466103 to 118966103. The genotypes of 83 individuals were used.
• ENm013. 683 SNPs genotyped in population ASW on chromosome 7 from position 89621624 to 90736048. The genotypes of 83 individuals were used.
Besides that, the authors stated that
We represent a genotype g_i of length L by a sequence over {0, 1, 2}^L (since we assume only bi-allelic SNPs). 0 and 1 stand for the minor and major homozygous types respectively, meaning that the two alleles at that SNP locus are either 0 or 1; the major and minor alleles are determined by the frequency at which each allele appears in the population. A value of 2 denotes the heterozygous type for an SNP having two different alleles.
And also
To obtain phasing of genotypes, we used the GEVALT algorithm.
I believe they obtain the aforementioned data in genotype format (something with 0, 1 and 2 values only) and then phasing it to apply in the problem. The image below is a example of this process.
So
- How and where can I download these data?
- I know HapMap is not available due to a security issue but there are the FTP, but the FTP folders are really complex to understand. Can someone guide me to download data from there?
I hope someone can help me. Thanks!
Kevin thanks for your answer! I'm interested in bioinspired algorithms applied in the tag SNP selection problem. I cited this manuscript as a example since the others I have reviewed consider basically the same input data format (something with 0, 1 and 2s). Do you know if I can find genotype data in this format or if I have to download a VCF (for example) and infer the homozygous and heterozygous values?
Even if I download the data in VCF, PLINK or other formats as you suggested, I do not know how to filter them to an specific population and position. Also the most of the papers I've read considerer the ENCODE regions from HapMap (ENm013, ENr113...), but I cannot found it to download.
It seems that you will just have to treat this as a big learning exercise. Obtaining and manipulating data into formats that you want for your analysis is a big part of bioinformatics. You will very rarely find data in the exact format that you want.
I would still go the PLINK route and, whilst I know all of the steps in order to merge data with PLINK, it's just something that you will have to learn. There are many posts on Biostars on these topics (just search via a search engine).
If you're looking at a published manuscript and it's in a reputable journal, then everything should be clearly stated in the methods. For example, searching for 1 minute, I was able to find this page, which has your ENCODE region co-ordinates: https://genome.ucsc.edu/ENCODE/regions.html
The middle co-ordinates for each are hg38. If you need to use hg19, then use the UCSC LiftOver tool.
Trust that this helps