SNP array data storage and management with RDBMS (postgres)
3
1
Entering edit mode
4.2 years ago
bluepoint ▴ 10

In our current data set, we have phenotype data on nearly one million individuals; we store this information in a Postgresql database. For about 500 of the individuals, we have 65k SNP chips that are all stored in their own text files. We are quickly obtaining more chips and looking for ways to efficiently organize these so that we can quickly select individuals from the DB that match certain phenotypic traits and try to build models from with their genotypes.

Does anyone have any experience with this? If we are just storing the sequence of alleles in the database, it would be impossible (postgres can't support 65K columns) and unreasonable to store one SNP per column. We could have a table where each record corresponds to a SNP and an individual, but then the table would be massive. Alternatively, we could try to use an array type and condense down the allele sequence to something that could fit in there (in this situation, each row would be an id for an individual and an array of all they alleles). I have also come across a paper http://www.sciencedirect.com/science/article/pii/S1476927107000059 where they have each row correspond to a specific SNP and that row has an array of the genotypes for the individuals. Do any of these methods make sense or does someone have a good recommendation?

Should we not even bother storing it in the database itself, but rather use the database to point to the text file somewhere else?

Sorry if my terminology is confusing and incorrect, I am a computer scientist just getting acquainted with bioinformatics!

Thank you!

SNP rdbms microarray • 1.2k views
1
Entering edit mode
4.2 years ago

I am not absolutely sure if I understood your data structure correctly, by I would go for a solution with "a table where each record corresponds to a SNP and an individual". Even though such a table will become large, Postgres should be able to handle this. In particular, if you set up unique index over the SNP and individual. Also, this format provides the most flexibility in terms of querying and extending the database (e.g., which individuals have a particular SNP).

I strongly vote against using array data types. Updates of arrays will become a pain if you need to add another individual. Additionally, you have to manage the individual-to-array-index by yourself. Postgres is able to ensure referential integrity if you store data in long-table format.

Actually, I use the long-table format to store micro-array expression data in a PostgreSQL database (and other kinds of data, but for now mirco-array is the largest data set). For example, the following selection queries the normalized intensities of all probes measuring EGFR expression in all samples. The query requires less than <300ms for the first execution and ~25ms for subsequent executions (including changing the gene name)

select ds.name, probeset.probeset, symbol.accession, gcrma.gcrma
from bioinfo_hgu133_gcrmas as gcrma, bioinfo_hgu133_probesets as probeset, bioinfo_hgu133_mappings as mapping, bioinfo_genomic_genesymbols as symbol, bioinfo_hgu133_datasets as dataset, bioinfo_datasets as ds
where gxp.probeset_id = probeset.id
and probeset.id = mapping.probeset_id
and mapping.genesymbol_id = symbol.id
and gxp.hgu133_dataset_id = dataset.id
and dataset.dataset_id = ds.id
and symbol.accession = 'EGFR'


The query runs on 792 individual datasets (i.e., micro arrays), >50k probesets per dataset mapped to 90k genes. The total number of individual measurements (gcrma.gcrma) is 38 million.

0
Entering edit mode

Hi- I agree with the idea of "a table where each record corresponds to a SNP and an individual" (i.e. the normalized format in database parlance, right?). I also agree about avoiding arrays.

However, I doubt postgres or any RDBMS can cope with the situation. 65000 chips with each, say, 1 million SNPs makes a table of 65 billion rows! With appropriate indexes you could quickly access a given SNP in a given individual but range queries like "give me the SNPs on chr1 between position x and y) or any join operation will take forever. (Of course one could cut down the number of genotypes to store by excluding those that are equal to the reference?)

0
Entering edit mode

I agree, that the size will be extremely large and I am also not 100% sure if the database will be able to manage it efficiently without careful database design. At least, PostreSQL states that there are no row count limits. ;-) Depending on the query, the runtime will be large for sure - but runtime will likely be long for any approach in accessing such a huge data set.

I also think that a prior data reduction is a key step. For example, we do SNP6 analysis but do not store individual probe intensities but only pre-calculated gene copy number information. In particular, storing the raw SNP data in the database does not make sense if you want to further process it. I do not think that any available tool will be able to access a custom database ;-)

As mentioned, I am not sure about the actual data size. Probably, the OP can specify which of the following is correct:

1. 65000 chips for 500 individuals with (lets say) 1mio SNPs per chip ~> 65000 mio data points/rows
2. 500 individuals/chips with 65000 SNPs per chip ~> 32,5 mio data points/rows

I am not sure if it makes sense to have 65000 chips for 500 individuals. The would mean each individual was analyzed 130 times on average... but its hard to know without more information from the OP I guess...

PS: Yes - it is normalized format :)

0
Entering edit mode
4.2 years ago

Depends on the speed you want, your budget and how often such search is needed.

I can help you with this. How may I contact you? You can email me at pon.petr@gmail.com

Thank you

0
Entering edit mode

There are several approaches:

1) reduce number of mutations per sample by removing "not interesting data" and use sql database

2) use commercial database (i.e. SAP HANA)

3) store samples or groups in separate files (splitting by chromosomes is also an option)

4) use in-memory databases for well-designed indexes

Many more ways.

I love big bio data analysis:)

0
Entering edit mode

Oh, and if you have microarray data than map and per file pairs is a good way to go. Plus you can use plink program for a lot of different ways to analyze and process your data

0
Entering edit mode
4.2 years ago

I don't have direct experience with the problem you describe, so I just throw some thoughts...

My gut feeling is that storing genotype data of that size in a traditional RDMBS is not feasible and maybe not even necessary. I think an RDMBS would be useful if multiple users need to access relatively few SNPs at a time. Also an RDMBS is great to enforce referential integrity.

I think I would store the sample metadata in a RDMBS, these data is relatively small and should benefit from enforcing referential integrity. The massive genotype data would stay either in indexed files (e.g. tabix), or using a dedicated framework like gemini or a non-relational database.

It's an interesting problem, if you get a chance post here your experience on how you go about it!