Question: Dataset With Snps Linked To Phenotype
9.5 years ago by
Trondheim, Norway
I'm looking for a possible things to do for my master thesis in data mining. And because I'm also interested in bioinformatics I was thinking about doing a GWAS study. The problem is that I'm not very familiar with the databases that are available for bioinformaticions on the web.

My question here is where from could I extract a dataset of SNPs that would be linked to preferably some kind of phenotype? I would like to do the whole genome study, not just on sample chromosome, so it would be much better if it's possible to get SNPs of a smaller/simpler organism.

Thanks for any input.

best regards, Rok

gwas snp dataset • 7.0k views
9.5 years ago by
Copenhagen, Denmark
A few pointers to where you could start searching:

However, these are database of digested GWAS results, which may not be the best starting point for data mining (depending on what you plan to do). Getting access to the actual raw data from human genome-wide association studies will be hard due to strict rules related to privacy.

9.5 years ago by
Boston, MA USA
Regardless of which organism and which set of phenotypes you choose, you need to look at what Andrew Johnson (of the Framingham Heart Study) did. His paper is here. I know the paper is a bit old (we're working with him in a minor way to update this work), but it is still extremely useful. In the work, he mined GWAS data for all kinds of signals below genome-wide thresholds of significance and collected all kinds of information that was curated manually. Signals at the same locus for the same or very nearly identical phenotypes in two different studies may not reach significance individually but could in a meta-analysis. Thus, his work allows one, for example, to focus where that meta-analysis could/should be done.

So, look to see what he did. Read the paper to see how he did that.

Now, in terms of a thesis project -- Building a database as you describe may not be enough. I would think long and hard on how you intend the data to be used. I feel that a strong database is one designed in such a way that it is very easily integrated into and with other data forms - such as gene expression. Johnson's database was insufficient for our needs of a disease-based approach to gene/SNP identification and so I used information and ideas from Robinson et al (The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease) to build a disease (i.e., phenotype) hierarchical classification scheme that we superimposed onto the Johnson database. This simply serves as an idea of how you should design what you intend to build.

9.5 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
in terms of knowing where to search for information, I always find very convenient to check the NAR 2010 Database Issue to keep track of the databasing efforts around. and, by the way, there are other very interesting resources on the NAR's homepage under the "THE JOURNAL" section at the right.

9.5 years ago by
Pune, India
A good database for this could be the Mouse Phenome database at Jackson Labs It includes a large number of phenotypic traits for inbred mice strains as well as genomic data.

9.5 years ago by
London, UK
A good dataset is available from the Wellcome Trust Case Control Consortium, with data on thousands of individuals and association with diseases like diabetes, tubercolosis, etc.. However, to access data, you have to make a official request and be supported by a Principal Investigator.

9.5 years ago by
United States
This is the ftp for dbGAP public access data.

9.0 years ago by
I would go to read about GWAS papers to think about a recurrent problem instead of applying ML directly to a GWAS dataset, personnaly I learned that a research work is more valuable if you answer questions that are causing problems to biologists. Anyway, that said you can go to NHGRI website where you find curated SNPs lists and try to expand them according to LD block analysis, there is also HuGENET which is a collaborative work and providing a list of SNP related to phenotypes and Traits

Good Luck Radhouane

9.0 years ago by
I think the easy and ready accessible DB would be Mouse database at Jackson Labs

Good luck, Ali

