Once I did a bioinformatics project using MongoDB, but I'm kind of confused which database software is more suitable for bioinformatics web development. Sure, MySQL, relational database, is great. However, what's the difference between the performance of these two kinds of database, relational and non-relational? PS: Does it depends on whether we use PHP or not?
This is a very general question, in particular what is bioinformatics web development? It really depends on the application case and requirements, and certainly the type of data to store. Note: Database Management Systems (DBMS) are a back-end technology meaning they are used for storing and retrieving the data in a reliable and performant way. web development refers to front-end technology, that is the presentation of the application to the user and how the application is interacting with the user. Deciding on the technology should be the last step in the design process of a software.
Some questions to guide the decision process:
- How is your data structured, can the data be represented in relational format?
- Have you modeled the data using UML/ER diagrams?
- Volume of your data?
- Do you update/read often or rarely?
- Do you update/read in parallel?
- Do you need transactions, atomic actions, ACID, to ensure consistency?
- Do you need foreign keys, indexing, stored procedures, views?
- Other performance boosting required?
- Replication required?
- Do you need a well-defined Query language (SQL, XPATH) or can you hardcode a few queries?
- can you setup a DB server in client server infrastructure?
- do you need remote connections (CLI, TCP, web-services) to the database?
- which authentication/authorization model do you use for your application?
- which backup strategy do you prefer/policies are in place at the site of deployment?
- Do language bindings to preferred frame-work/coding tools/languages exist at the appropriate level of abstraction?
- Which support model do you prefer? (Some might prefer a commercial support model)
- PostgreSQL (most feature-rich, covers practically everything named above, richest SQL, including Document database/NoSQL functions, will work most of the time)
- MySQL, MariaDB, etc. (very similar in feature richness, maybe a bit higher performance)
- SQLite (file based SQL database management, no server required, great for testing, should fit most bioinformatics needs, unless many concurrent updates, easy to share as a file - no dump required, easiest way to start and to learn SQL)
- Oracle (if you need a commercial DBMS)
- Large variable matrices (microarrays, RNA-seq data, other measurements) or array-shaped data with variable dimensions or loosely defined structure are not such a good fit for Relational Databases, HDF5 or similar formats could be a reasonable alternative, possibly keeping metadata in MongoDB.
Micheal has given and asked some really excellent questions, so I'll just share what we've been doing.
Our approach is developing and deploying events driven web applications using serverless technology (https://aws.amazon.com/lambda/ ) and couple these applications with various serverless database services (https://aws.amazon.com/dynamodb/, https://aws.amazon.com/rds/aurora/, and https://cloud.google.com/bigquery/ ). The idea behind this approach is to reduce the group's burden in maintaining running servers so we can focus on the development and analytics.