Question: Dataset With Snps Linked To Phenotype
gravatar for Rok
9.5 years ago by
Trondheim, Norway
Rok190 wrote:


I'm looking for a possible things to do for my master thesis in data mining. And because I'm also interested in bioinformatics I was thinking about doing a GWAS study. The problem is that I'm not very familiar with the databases that are available for bioinformaticions on the web.

My question here is where from could I extract a dataset of SNPs that would be linked to preferably some kind of phenotype? I would like to do the whole genome study, not just on sample chromosome, so it would be much better if it's possible to get SNPs of a smaller/simpler organism.

Thanks for any input.

best regards, Rok

gwas snp dataset • 7.0k views
ADD COMMENTlink modified 9.5 years ago by Ali0 • written 9.5 years ago by Rok190
gravatar for Lars Juhl Jensen
9.5 years ago by
Copenhagen, Denmark
Lars Juhl Jensen11k wrote:

A few pointers to where you could start searching:

However, these are database of digested GWAS results, which may not be the best starting point for data mining (depending on what you plan to do). Getting access to the actual raw data from human genome-wide association studies will be hard due to strict rules related to privacy.

ADD COMMENTlink modified 9.5 years ago by Brad Chapman9.5k • written 9.5 years ago by Lars Juhl Jensen11k
gravatar for Larry_Parnell
9.5 years ago by
Boston, MA USA
Larry_Parnell16k wrote:

Regardless of which organism and which set of phenotypes you choose, you need to look at what Andrew Johnson (of the Framingham Heart Study) did. His paper is here. I know the paper is a bit old (we're working with him in a minor way to update this work), but it is still extremely useful. In the work, he mined GWAS data for all kinds of signals below genome-wide thresholds of significance and collected all kinds of information that was curated manually. Signals at the same locus for the same or very nearly identical phenotypes in two different studies may not reach significance individually but could in a meta-analysis. Thus, his work allows one, for example, to focus where that meta-analysis could/should be done.

So, look to see what he did. Read the paper to see how he did that.

Now, in terms of a thesis project -- Building a database as you describe may not be enough. I would think long and hard on how you intend the data to be used. I feel that a strong database is one designed in such a way that it is very easily integrated into and with other data forms - such as gene expression. Johnson's database was insufficient for our needs of a disease-based approach to gene/SNP identification and so I used information and ideas from Robinson et al (The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease) to build a disease (i.e., phenotype) hierarchical classification scheme that we superimposed onto the Johnson database. This simply serves as an idea of how you should design what you intend to build.

ADD COMMENTlink written 9.5 years ago by Larry_Parnell16k
gravatar for Jorge Amigo
9.5 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

in terms of knowing where to search for information, I always find very convenient to check the NAR 2010 Database Issue to keep track of the databasing efforts around. and, by the way, there are other very interesting resources on the NAR's homepage under the "THE JOURNAL" section at the right.

ADD COMMENTlink written 9.5 years ago by Jorge Amigo11k
gravatar for Farhat
9.5 years ago by
Pune, India
Farhat2.9k wrote:

A good database for this could be the Mouse Phenome database at Jackson Labs It includes a large number of phenotypic traits for inbred mice strains as well as genomic data.

ADD COMMENTlink written 9.5 years ago by Farhat2.9k
gravatar for Giovanni M Dall'Olio
9.5 years ago by
London, UK
Giovanni M Dall'Olio27k wrote:

A good dataset is available from the Wellcome Trust Case Control Consortium, with data on thousands of individuals and association with diseases like diabetes, tubercolosis, etc.. However, to access data, you have to make a official request and be supported by a Principal Investigator.

ADD COMMENTlink modified 9.0 years ago • written 9.5 years ago by Giovanni M Dall'Olio27k
gravatar for jvijai
9.5 years ago by
United States
jvijai1.2k wrote:

This is the ftp for dbGAP public access data.

ADD COMMENTlink written 9.5 years ago by jvijai1.2k
gravatar for Radhouane Aniba
9.0 years ago by
Radhouane Aniba760 wrote:

I would go to read about GWAS papers to think about a recurrent problem instead of applying ML directly to a GWAS dataset, personnaly I learned that a research work is more valuable if you answer questions that are causing problems to biologists. Anyway, that said you can go to NHGRI website where you find curated SNPs lists and try to expand them according to LD block analysis, there is also HuGENET which is a collaborative work and providing a list of SNP related to phenotypes and Traits

Good Luck Radhouane

ADD COMMENTlink written 9.0 years ago by Radhouane Aniba760
gravatar for Ali
9.0 years ago by
Ali0 wrote:

I think the easy and ready accessible DB would be Mouse database at Jackson Labs

Good luck, Ali

ADD COMMENTlink written 9.0 years ago by Ali0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1719 users visited in the last hour