Dataset With Snps Linked To Phenotype
8
5
Entering edit mode
13.3 years ago
Rok ▴ 190

Hello!

I'm looking for a possible things to do for my master thesis in data mining. And because I'm also interested in bioinformatics I was thinking about doing a GWAS study. The problem is that I'm not very familiar with the databases that are available for bioinformaticions on the web.

My question here is where from could I extract a dataset of SNPs that would be linked to preferably some kind of phenotype? I would like to do the whole genome study, not just on sample chromosome, so it would be much better if it's possible to get SNPs of a smaller/simpler organism.

Thanks for any input.

best regards, Rok

snp gwas dataset • 8.4k views
ADD COMMENT
5
Entering edit mode
13.3 years ago

A few pointers to where you could start searching:

However, these are database of digested GWAS results, which may not be the best starting point for data mining (depending on what you plan to do). Getting access to the actual raw data from human genome-wide association studies will be hard due to strict rules related to privacy.

ADD COMMENT
5
Entering edit mode
13.3 years ago

Regardless of which organism and which set of phenotypes you choose, you need to look at what Andrew Johnson (of the Framingham Heart Study) did. His paper is here. I know the paper is a bit old (we're working with him in a minor way to update this work), but it is still extremely useful. In the work, he mined GWAS data for all kinds of signals below genome-wide thresholds of significance and collected all kinds of information that was curated manually. Signals at the same locus for the same or very nearly identical phenotypes in two different studies may not reach significance individually but could in a meta-analysis. Thus, his work allows one, for example, to focus where that meta-analysis could/should be done.

So, look to see what he did. Read the paper to see how he did that.

Now, in terms of a thesis project -- Building a database as you describe may not be enough. I would think long and hard on how you intend the data to be used. I feel that a strong database is one designed in such a way that it is very easily integrated into and with other data forms - such as gene expression. Johnson's database was insufficient for our needs of a disease-based approach to gene/SNP identification and so I used information and ideas from Robinson et al (The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease) to build a disease (i.e., phenotype) hierarchical classification scheme that we superimposed onto the Johnson database. This simply serves as an idea of how you should design what you intend to build.

ADD COMMENT
2
Entering edit mode
13.3 years ago

in terms of knowing where to search for information, I always find very convenient to check the NAR 2010 Database Issue to keep track of the databasing efforts around. and, by the way, there are other very interesting resources on the NAR's homepage under the "THE JOURNAL" section at the right.

ADD COMMENT
2
Entering edit mode
13.3 years ago
Farhat ★ 2.9k

A good database for this could be the Mouse Phenome database at Jackson Labs http://phenome.jax.org/ It includes a large number of phenotypic traits for inbred mice strains as well as genomic data.

ADD COMMENT
1
Entering edit mode
13.3 years ago

A good dataset is available from the Wellcome Trust Case Control Consortium, with data on thousands of individuals and association with diseases like diabetes, tubercolosis, etc.. However, to access data, you have to make a official request and be supported by a Principal Investigator.

ADD COMMENT
1
Entering edit mode
13.3 years ago
jvijai ★ 1.2k

This is the ftp for dbGAP public access data. ftp://ftp.ncbi.nlm.nih.gov/dbgap

ADD COMMENT
0
Entering edit mode
12.8 years ago

I would go to read about GWAS papers to think about a recurrent problem instead of applying ML directly to a GWAS dataset, personnaly I learned that a research work is more valuable if you answer questions that are causing problems to biologists. Anyway, that said you can go to NHGRI website where you find curated SNPs lists and try to expand them according to LD block analysis, there is also HuGENET which is a collaborative work and providing a list of SNP related to phenotypes and Traits

Good Luck Radhouane

ADD COMMENT
0
Entering edit mode
12.8 years ago
Ali • 0

I think the easy and ready accessible DB would be Mouse database at Jackson Labs http://phenome.jax.org/.

Good luck, Ali

ADD COMMENT

Login before adding your answer.

Traffic: 1796 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6