Question

Dataset With Snps Linked To Phenotype

5

Entering edit mode

13.4 years ago

Rok ▴ 190

Hello!

I'm looking for a possible things to do for my master thesis in data mining. And because I'm also interested in bioinformatics I was thinking about doing a GWAS study. The problem is that I'm not very familiar with the databases that are available for bioinformaticions on the web.

My question here is where from could I extract a dataset of SNPs that would be linked to preferably some kind of phenotype? I would like to do the whole genome study, not just on sample chromosome, so it would be much better if it's possible to get SNPs of a smaller/simpler organism.

Thanks for any input.

best regards, Rok

snp gwas dataset • 8.4k views

ADD COMMENT • link updated 13.3 years ago by Ali • 0 • written 13.4 years ago by Rok ▴ 190

Brad Chapman · Answer 1 · 2010-12-30

A few pointers to where you could start searching:

However, these are database of digested GWAS results, which may not be the best starting point for data mining (depending on what you plan to do). Getting access to the actual raw data from human genome-wide association studies will be hard due to strict rules related to privacy.

score 5 · Answer 2 · 2010-12-30

Regardless of which organism and which set of phenotypes you choose, you need to look at what Andrew Johnson (of the Framingham Heart Study) did. His paper is here. I know the paper is a bit old (we're working with him in a minor way to update this work), but it is still extremely useful. In the work, he mined GWAS data for all kinds of signals below genome-wide thresholds of significance and collected all kinds of information that was curated manually. Signals at the same locus for the same or very nearly identical phenotypes in two different studies may not reach significance individually but could in a meta-analysis. Thus, his work allows one, for example, to focus where that meta-analysis could/should be done.

So, look to see what he did. Read the paper to see how he did that.

Now, in terms of a thesis project -- Building a database as you describe may not be enough. I would think long and hard on how you intend the data to be used. I feel that a strong database is one designed in such a way that it is very easily integrated into and with other data forms - such as gene expression. Johnson's database was insufficient for our needs of a disease-based approach to gene/SNP identification and so I used information and ideas from Robinson et al (The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease) to build a disease (i.e., phenotype) hierarchical classification scheme that we superimposed onto the Johnson database. This simply serves as an idea of how you should design what you intend to build.

score 2 · Answer 3 · 2010-12-30

2

Entering edit mode

13.4 years ago

Jorge Amigo 14k

in terms of knowing where to search for information, I always find very convenient to check the NAR 2010 Database Issue to keep track of the databasing efforts around. and, by the way, there are other very interesting resources on the NAR's homepage under the "THE JOURNAL" section at the right.

ADD COMMENT • link 13.4 years ago by Jorge Amigo 14k

score 2 · Answer 4 · 2010-12-30

2

Entering edit mode

13.4 years ago

Farhat ★ 2.9k

A good database for this could be the Mouse Phenome database at Jackson Labs http://phenome.jax.org/ It includes a large number of phenotypic traits for inbred mice strains as well as genomic data.

ADD COMMENT • link 13.4 years ago by Farhat ★ 2.9k

score 1 · Answer 5 · 2010-12-30

1

Entering edit mode

13.4 years ago

Giovanni M Dall'Olio 28k

A good dataset is available from the Wellcome Trust Case Control Consortium, with data on thousands of individuals and association with diseases like diabetes, tubercolosis, etc.. However, to access data, you have to make a official request and be supported by a Principal Investigator.

ADD COMMENT • link 12.9 years ago by Giovanni M Dall'Olio 28k

score 1 · Answer 6 · 2010-12-30

1

Entering edit mode

13.4 years ago

jvijai ★ 1.2k

This is the ftp for dbGAP public access data. ftp://ftp.ncbi.nlm.nih.gov/dbgap

ADD COMMENT • link 13.4 years ago by jvijai ★ 1.2k

score 0 · Answer 7 · 2011-06-28

I would go to read about GWAS papers to think about a recurrent problem instead of applying ML directly to a GWAS dataset, personnaly I learned that a research work is more valuable if you answer questions that are causing problems to biologists. Anyway, that said you can go to NHGRI website where you find curated SNPs lists and try to expand them according to LD block analysis, there is also HuGENET which is a collaborative work and providing a list of SNP related to phenotypes and Traits

Good Luck Radhouane

score 0 · Answer 8 · 2011-07-01

0

Entering edit mode

12.9 years ago

Ali • 0

I think the easy and ready accessible DB would be Mouse database at Jackson Labs http://phenome.jax.org/.

Good luck, Ali

ADD COMMENT • link 12.9 years ago by Ali • 0