9.5 years ago by
Boston, MA USA
Regardless of which organism and which set of phenotypes you choose, you need to look at what Andrew Johnson (of the Framingham Heart Study) did. His paper is here. I know the paper is a bit old (we're working with him in a minor way to update this work), but it is still extremely useful. In the work, he mined GWAS data for all kinds of signals below genome-wide thresholds of significance and collected all kinds of information that was curated manually. Signals at the same locus for the same or very nearly identical phenotypes in two different studies may not reach significance individually but could in a meta-analysis. Thus, his work allows one, for example, to focus where that meta-analysis could/should be done.
So, look to see what he did. Read the paper to see how he did that.
Now, in terms of a thesis project -- Building a database as you describe may not be enough. I would think long and hard on how you intend the data to be used. I feel that a strong database is one designed in such a way that it is very easily integrated into and with other data forms - such as gene expression. Johnson's database was insufficient for our needs of a disease-based approach to gene/SNP identification and so I used information and ideas from Robinson et al (The Human Phenotype Ontology: a tool for annotating and analyzing human hereditary disease) to build a disease (i.e., phenotype) hierarchical classification scheme that we superimposed onto the Johnson database. This simply serves as an idea of how you should design what you intend to build.