Question: How to design an SQL relational database for gene expression data?
gravatar for enricoferrero
4.9 years ago by
United Kingdom
enricoferrero780 wrote:


I have to design a relational database to store the results of gene expression experiments (both microarrays and RNA-seq), using MySQL.

I have a number of experiments, and for each experiment several contrasts/comparisons. For each contrast, I have thousands of genes with fold changes and p-values that I want to store. The database will be queried to retrieve genes showing differential expression across comparisons and across experiments. 

What's the best design for such a scenario? I'm thinking to create one table per experiment but I'm not sure this is the best way to go about it. I'd be curious to hear from people who have experience in designing relational databases or who use SQL to query databases containing gene expression data.

Thank you!

ADD COMMENTlink modified 4.9 years ago by ilya.shamovsky20 • written 4.9 years ago by enricoferrero780

Using a database system to store genes with just p-val and fold-change seems a bit overkill. What's the size of your data? If it is less than few hundred megabytes, you could probably just load the entire thing into memory to serve to the user via a simple webserver framework like or node.js.

ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by Damian Kao15k

Thanks for the suggestion Damian.

That's not what I need though. I'm oversimplifying the situation in my question. In reality I'll have several columns to store for each comparison and experiment. Besides, this will be part of a larger projects and has to be a relational database.

ADD REPLYlink written 4.9 years ago by enricoferrero780
gravatar for ilya.shamovsky
4.9 years ago by
United States
ilya.shamovsky20 wrote:

You can have a look at the table organization for CummeRbund. It's for one experiment only, but you can easily add experiment  on top of that, something similar to this.

ADD COMMENTlink written 4.9 years ago by ilya.shamovsky20
gravatar for Jean-Karim Heriche
4.9 years ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche21k wrote:

When designing databases, I try to follow a business object model. This usually means having a table per entity/class (e.g. a gene table, an experiment table...)and mapping their attributes to columns. I then apply normalization constraints, typically trying to achieve third normal form. I denormalize when it makes sense (e.g. faster queries). I would start simple with just core entities and their relationships. I also use MySQL Workbench to help with the design.

ADD COMMENTlink written 4.9 years ago by Jean-Karim Heriche21k
gravatar for pld
4.9 years ago by
United States
pld4.8k wrote:

If this is for a single large experiment where the points of comparison are fixed the column approach could work, otherwise I would suggest adding an 'experiment' component.

What I've done in the past is have experiment, condition and fold change tables. Each experiment has several conditions, then you make two keys for the fold change table which refer to conditions. In the fold change table I made keys for the experiment and the two conditions. The exact implementation is a bit more complex, it was for microarray data so I built in the array's probes and so on into this. That way it was possible to pull fold change by probe id as well as the raw intensity values for each instance of that probe spotted on the array.

This made it relatively easy to just dump the fold changes from limma/etc into a single table using a python script to insert the relevant rows into the experiment and condition tables.

The idea was similar to what you are saying, the plan was to be able to pull fold changes for a given gene across individual experiments to make meta-analysis easier.

There is probably a better way, but you want to avoid adding tables for each experiment or comparison.

ADD COMMENTlink written 4.9 years ago by pld4.8k

Thanks Joe, this is exactly the sort of answer I was looking for!

I'm not sure I completely understand what you mean with the two keys for the fold change table. Do you care to explain with an example? Or, even better, can you show me what the tables would look like? Cheers!

ADD REPLYlink written 4.9 years ago by enricoferrero780

You have one key per condition, a fold change is derived from two conditions. Maybe key isn't the right term.

Say in my comparison table I have two rows:

Experiment        Condition        Description

foo               foo.1            <text description of condition>

foo               foo.2            <text description of condition>

A row in the fold change column may look like

Condition1 Condition2 Marker FC    log2FC pval   fdr.pval

foo.1      foo.2      geneX  0.25  -2     0.0001 0.001

It isn't the best approach, I had to join the condition table twice. It isn't properly normalized but I was trying to keep a happy mixture.

ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by pld4.8k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2054 users visited in the last hour