We are currently setting up a private cloud in our research department. Our goal is to improve data handling, organization and analysis of our various data types and structures (in total approx. 350TB). Now I have to choose a data base management system, which is capable of handling a data lake efficiently. Since I am not an expert in data bases, I would really appreciate some constructive comments on my thoughts. Here we go:
- Given that we deal with semi- or rather unstructured data, I will use a NoSQL data base
- I would choose between MongoDB, Cassandra or Apache HBase. With MongoDB I can enjoy a JSON-based document structure, although MapReduce implementations still remain a slow process (deprecated from version 5 onwards!) and memory hogging is still an issue. MongoDB stores data in chuncks of 64MB. I am bit worried about the performance when feeding in data-points of 100GB to 150GB size. Just like HBase, replication is based on a Master-Slave principle, which could be error prone in an event of server failure. Cassandra on the other hand stores data in columns and rows with a SQL-like syntax and a Masterless-Ring replication. With HBase, I cloud rely on HDFS, Zookeeper and rest of the Apache-Crew.
In general I have a tendency to use MongoDB, but I am not 100% convinced. What would you say?
Thanks in advance for some critical comments!
For this forum, I guess you'd need to connect a bit more to the bioinformatics part. In general I'm not sure you target the best community here, though there's undoubtfully one or the other nosql database specialist here, you might be able to find many more elsewhere.
For example I have no more than some beginners experience with nosql databases (specifically MongoDB). However, I tend to argue that most data in bioinformatics is actually quite structured and fits relatively well in a SQL database (called variants, genome annotation like genes, transcripts and proteins, etc). The most unstructured part often is the metadata for sample specimens and even that can be structured with comparably low effort. In my opinion the only gain I get from a nosql database is to save me from schema updates.
What are the "datapoints of 100GB to 150GB size" you want to feed in? JSON or binary BAM files?
To protect for server failure, you can set up a sharded cluster with replica sets, so you don't have to worry too much about failures.
Thanks for the reply and your opinion. There reason why I am referring to a data lake is because of our data variety. We process genome files (BAM), MRT- as well as PET-scans, histological images and lab reports. That is also why I think noSQL is appropriate. In fact, I had genome files (BAM) in mind when I was writing about data points of 100GB to 150GB in size. Concerning your suggestion with shared cluster with replica: I also found this when studying MongoDB a bit more in detail. HBase with HDFS also creates replica sets while writing data in a shared system (if I got the correctly). I just wanted to hear an opinion, which seems more appropriate for my usecase.
Your task and location almost make me believe we’re working in the same company. As Istvan Albert stated below, I’d recommend to go with links to your data lake, too.
Also I’m pretty sure you’ll end up with performance issues when mangling big blobs and documents in the same data store, see for example this anecdotal SO question. Maybe the CephFS mentioned there could be useful to you, too.
I am already using CephFS for a different project and I am very happy with its performance. When it comes to distributed FS, MooseFS is also an option. This is actually a very good idea, I cloud try to couple CephFS/ MooseFS with my data base management of choice. Let's see how to implement this. Thanks!