In our current data set, we have phenotype data on nearly one million individuals; we store this information in a Postgresql database. For about 500 of the individuals, we have 65k SNP chips that are all stored in their own text files. We are quickly obtaining more chips and looking for ways to efficiently organize these so that we can quickly select individuals from the DB that match certain phenotypic traits and try to build models from with their genotypes.
Does anyone have any experience with this? If we are just storing the sequence of alleles in the database, it would be impossible (postgres can't support 65K columns) and unreasonable to store one SNP per column. We could have a table where each record corresponds to a SNP and an individual, but then the table would be massive. Alternatively, we could try to use an array type and condense down the allele sequence to something that could fit in there (in this situation, each row would be an id for an individual and an array of all they alleles). I have also come across a paper http://www.sciencedirect.com/science/article/pii/S1476927107000059 where they have each row correspond to a specific SNP and that row has an array of the genotypes for the individuals. Do any of these methods make sense or does someone have a good recommendation?
Should we not even bother storing it in the database itself, but rather use the database to point to the text file somewhere else?
Sorry if my terminology is confusing and incorrect, I am a computer scientist just getting acquainted with bioinformatics!