Question

Forum:Fundamentals Of Designing Biological Databases?

12

Entering edit mode

13.7 years ago

anon1234z ▴ 130

I'm currently a postdoc with a solid background in bioinformatics, and I've gotten by for the last few years by storing everything I use in flat files. So far it hasn't been a problem, but I can't help feeling like I should know more about databases so that I can use them where appropriate.

Sure, I've used databases. I can write basic SQL statements (select * from blah) or insert and delete rows. I've used this to design uber-simple web apps in the past. Where I'm deficient is understanding how to design them well. What columns should be indexed? What's all this stuff about normalization? Why would I prefer one database type over another? (MySQL, SQLlite, MongoDB, etc). Which are most commonly used for bioinformatics work, and in what subdomains (genomics, proteomics, MD, etc)?

I'd like the community's help in finding some good tutorials that will get an experienced coder and bioinformatician up to speed in the minimum amount of time possible. Bonus points if they specifically address the types of big-data problems that we're all facing in the new high-throughput world of bioinformatics.

database • 17k views

ADD COMMENT • link updated 17 months ago by Ram 44k • written 13.7 years ago by anon1234z ▴ 130

7

Entering edit mode

Your question would be a lot easier to answer (and research as well) if you reduced its scope. The expression "biological databases" is too generic, one would most certainly use different techniques when storing intervals vs storing a reads or microarray data etc

ADD REPLY • link 13.7 years ago by Istvan Albert 101k

0

Entering edit mode

The most important question is what problems you want to solve. Learning purely for the purpose of learning will not teach you much. Are you going to design a database targeting at NAR database issue? If not, do not waste your time on SQL.

ADD REPLY • link 13.7 years ago by lh3 33k

score 8 · Answer 1 · 2010-11-10

8

Entering edit mode

13.7 years ago

Gww ★ 2.7k

There are some outstanding tips this article on stackoverflow. It's a nice summary of common mistakes people make when designing databases.

ADD COMMENT • link 13.7 years ago by Gww ★ 2.7k

score 7 · Answer 2 · 2010-11-10

7

Entering edit mode

13.7 years ago

Andrea_Bio ★ 2.8k

Here is a link to a wiki book called Bioinformatics data management which has explains ER theory and normalisation and has some exercises

As for indexes, that really does depend on your database. As a basic example if you have a database storing millions of SNPs and you have a table SNPs with fields like chromosome and locus representing the location of the SNP, and you might want to do a query like 'select * from SNPs where chromosome=10 and locus >1000000'. Here you want to index those fields so the DBMS can go straight to the SNPs on chromosome 10, and find the SNPs which are after base 1 million, rather than having to wade through all the other SNPs found 'before' that. As a rule of thumb, if you think you were doing the query manually and you would find it easier to have the field indexed, then index it. Some people say you sholud index every primary key, foreign key and column in a where clause.

ADD COMMENT • link 13.7 years ago by Andrea_Bio ★ 2.8k

0

Entering edit mode

Indexing primary keys is mandatory at least in the databases I'm familiar with. Indexing all foreign keys is just wrong advice. Indexes should be added only if they increase performance. Databases usually have rich set of tools to measure the performance.

ADD REPLY • link 13.7 years ago by Jussi ▴ 180

0

Entering edit mode

"Indexing all foreign keys is just wrong advice". Yes. Especially, if one works with MyISAM table (the most popular mySQL engine) where a foreign key has no meaning for the engine.

ADD REPLY • link 13.7 years ago by D. Puthier ▴ 350

0

Entering edit mode

That depends on your database. I never said it applied to every database. From the mysql user manual:

"InnoDB requires indexes on foreign keys and referenced keys so that foreign key checks can be fast and not require a table scan. "

ADD REPLY • link 13.7 years ago by Andrea_Bio ★ 2.8k

0

Entering edit mode

I'm exaggerating a bit here, but it depends on the application, not the database engine. For example, it does not make any sense indexing foreign keys on data warehouse application to speed up foreign key checks because the data is mostly read only. And when the data changes, full scan may be the fastest method for checking referential integrity.

ADD REPLY • link 13.7 years ago by Jussi ▴ 180

0

Entering edit mode

I didn't think the question author would be writing data warehouse applications given their introductory question on databases. I apologise for being too simplistic. I expect the author will have learnt a bit more about indexes from our discussion so all it not lost.

ADD REPLY • link 13.7 years ago by Andrea_Bio ★ 2.8k

score 7 · Answer 3 · 2010-11-10

I would start by reading about the design of some successful databases used in biology. For example:

BioSQL (MySQL) which is "a generic relational model covering sequences, features, sequence and feature annotation, a reference taxonomy, and ontologies"
The Ensembl Computing Architecture (MySQL), Genome Research paper describing Ensembl
The Integrated Microbial Genomes database (Oracle)
search at PubMed for terms such as "bioinformatics database design".

Most resources currently employ SQL; it's very early days for so-called "NoSQL" deployment in biology. Search Google for "bioinformatics mongodb", for example: you will find a few blog posts, some by BioStar members, exploring the topic.

score 4 · Answer 4 · 2010-11-09

Each database design is a new problem and there is no universal way to design it. I guess StackOverflow would be a better place to answer your question:

about normalization: http://stackoverflow.com/questions/1258743 (...)
about indexes: http://stackoverflow.com/questions/403393 (...)
choosing an engine: http://stackoverflow.com/questions/831849 (...)
etc...

score 4 · Answer 5 · 2010-11-10

The database applications fall into two main categories: Online Transaction processing (OLTP) and Data warehousing (DW). Most of the advice given here are related to OLTP only and should not applied to DW. I'm not touching the OLTP and modeling since other people have already covered those.

Designing a DW is quite different from OLTP. If you have "big data" problems, you are probably building a DW for that. Then you should

Choose the database engine based on your requirements. "Big data" often requires Hash based algorithms: hash join, hash group by, index hash joins, etc.
Design the physical structure based on your requirements: rows that are queried together should be close to each other. Use Clustered Indexes (MS SQL), Index organized tables, cluster organized tables (Oracle), etc. to control that.
Do not try to make all queries to use indexes, full scanning a table is by far the most efficient access method if you process large portion of the rows.
Limit the amount of data processed by queries, use partitioning
Use parallel full scans if you have enough disk throughput
Use materialized views to speed up large queries (example).
Try to minimize the size of the storage: use table compression, index key compression, bitmap indexes, function based indexes (example), etc.
Understand the database engine you are using and use the features it provides
Validate the tips you read on the internet, those are often wrong or out of context. Including this one.

score 3 · Answer 6 · 2010-11-10

Concerning database modeling the easiest way to get something with few redondancy (that is "normalized") and reasonable performance is to use the MERISE method (I think this is the french equivalent to E/A modelling). You will have to draw a conceptual data model (MCD) (a diagram with entities and associations) that will also include cardinalities between those entities (one to one, one to many, many to many). Based on the diagram you can create a MLD (logical model of data). The MLD is almost the definitive database shema. To create a MLD there are few rules. The three most important to my knowledge are:

if a many-to-many relationship exists between entities A and B, a new table C is created that will contain the two PK a,b of A and B (thus they will be foreign keys). The PK of C will be a composite of a and b.
if a one-to-many relationship exists between entities A and B the PK of B is copied in A and becomes a FK.
if a one-to-one relationship exists, A and B should be part of the same entity.

From the MLD the next step is to adapt it to your own database system (mySQL, PostgreSQL, ORACLE, DB2). This last step is the MPD (physical model of data).

You can use AnalseSI or Power AMC as a tool to go from MCD>MLD>MPD. However for a more sophisticated MPD, you should draw MPD from MLD using a dedicated tool such as mysql-workbench (if working with mySQL). I have discovered the MERISE method ~ 5 years ago. It has know become extremly simple to modelize anything. However one should concider that this are the canonical rules but sometimes you can decide not to follow them because your database shema would not fit with your data structure (you would have to create many perl scripts to make them fit with the database model.

Hope this will give you some clues.

Sorry but some terms (MLD, MCD,..) are translated from french. i'm not sure about their english translation.

score 3 · Answer 7 · 2010-11-10

Just to reiterate some of the comments - scope is everything for a bioinformatics database.

If the DB does everything, its likely that it won't do anything.

Normalization is useful, but like most programming maxims you have to use it only where appropriate and break it where it kills performance. The truth about any relational data system is that you can't have to many joins. If you have more than 20 tables in your database and you are probably going to have performance problems. If you use more than maybe 4-5 tables in any given query it could take hours to get a response back.

Bioinformatics people are often 5-7 years out of date with these ideas. MongoDB and other new data retrieval systems are great - they will be useful for NGS applications and no doubt there will be better ones coming up soon.

But one of the best performance features they have is that they do not let developers have many tables. After that, giving up ACID compliance in exchange for speed is usually the second bump they have.

Mysql may give the same performance as MongoDB when you don't use joins.

For starters, learn what a star schema is and its variants, read up a little bit on RDBMS data mining and you will get an idea of how data retrieval performance works. I also think MongoDB is a great place to look next because of its auto scaling abilities. Not sure its a great place to store sequence data you want to search though.

score 1 · Answer 8 · 2011-03-14

You've gotten some good pointers on where to find more info. I work in industry, so my DB design problems and constraints are probably a little different than yours. But my general advice is:

Learn from the general database design resources out there. Biology data is different, but not THAT different. I have watched many people struggle to solve database design problems that have been solved in other domains, just because they were too arrogant to admit that they could learn something from a non-scientific application. I've also watched people develop databases with staggeringly stupid designs because they were just ignorant of relational theory. C.J. Date has an O'Reilly book that is a pretty good and brief intro to relational theory.
Be suspicious of anyone who tells you that you "always" have to break normalization because of performance issues. That just isn't true. It depends on the size of your tables, the details of what you are doing with them, the details of your hardware, and the options provided by your database software. I tend to think that people who make these sorts of blanket statements are either lazy or unaware of the risks associated with denormalizing.
If you do decide to denormalize, make sure you aren't hiding ignorance about the structure of the data. I generally capture the normalized data model first, and then document denormalizations and why they were done. It makes the database easier to maintain. There are tools out there that will help you keep track of this stuff. Some of them are even free (e.g., Oracle Data Modeler).

score 0 · Answer 9 · 2010-11-09

0

Entering edit mode

13.7 years ago

Anon1234Y ▴ 160

Why don't you start by using google? Or why don't you start taking a course? In every university you will find courses that will teach you the basics. For instance, that mysql, etc. are database management systems and no "types of databases"... Are you even sure you need an relational database?

ADD COMMENT • link 13.7 years ago by Anon1234Y ▴ 160

2

Entering edit mode

+1 Giving so many down votes is not a good way to encourage a newcomer even if (s)he is using a temporary ID. I actually think this answer is good. At least (s)he questioned the necessity of using SQL, which is better than letting anon1234z pursue a wrong direction.

ADD REPLY • link 13.7 years ago by lh3 33k

1

Entering edit mode

The advice is sound but formulating it in a more 'diplomatic' way would avoid the appearance of it being a little bit antagonistic

ADD REPLY • link 13.7 years ago by Istvan Albert 101k

0

Entering edit mode

+1 for speaking out the truth.

ADD REPLY • link 13.7 years ago by lh3 33k

0

Entering edit mode

Needlessly inflammatory and doesn't address the question at all. Also, respondent should spend 30 seconds learning the difference betwween Mysql, SQLite, and MongoDB to find out that they are representatives of different approaches to data storage. Maybe Google would help?

ADD REPLY • link 13.7 years ago by David Quigley 11k

0

Entering edit mode

I might go against the downvoting trend here for a second and +1 this :) I agree the "google" suggestion is less than useful, but I think the basic point is sound - when coming at databases from a very novice level, finding a basic course at a university might be a good place to start. I think jumping straight into articles about normalisation, etc., without some context will be a bit daunting. As Istavan implies, the question is too broad to be treated as a standard Q&A.

ADD REPLY • link 13.7 years ago by Bio_X2Y ★ 4.4k

score 0 · Answer 10 · 2018-01-22

0

Entering edit mode

6.5 years ago

bioinfo8 ▴ 230

Hi all,

I am looking for ways to create a biological database. I have some data in MS-Access and kind of confused on which approach I should follow to create database. I would like to have an updated opinion from the biostars community on the different ways of creating a database, front and back-end options, using R or Python and publishing on institute website/github.

Please guide.

Thanks.

ADD COMMENT • link 6.5 years ago by bioinfo8 ▴ 230

0

Entering edit mode

You should post your own question making sure that you define the scope better e.g. what kind of data do you want to store ? What would be the access pattern ?

ADD REPLY • link 6.5 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Jean-Karim Heriche I would like to store information on a protein family among different organisms including their names, gene information, sequence information, structure etc. Would you please elaborate more on access pattern?

ADD REPLY • link 6.5 years ago by bioinfo8 ▴ 230

0

Entering edit mode

What JKH said: "You should post your own question". Furthermore, this sounds like a very basic database problem, somehow not related to bioinformatics.... see my comments A: Fundamentals Of Designing Biological Databases?

ADD REPLY • link 6.5 years ago by Pierre Lindenbaum 163k