I’m trying to find out and compile info on how big labs (e.g. 40+ people) or small organisations who produce large amounts of NGS data manage the data storage issue. As a small organisation, we produce 10-15 TBs of HTS data every year from large genomics and metagenomics projects. We are often dependent on an external HPC cluster service provider for storing our TBs of data in parallel to doing analysis there, along with a few more Tbs on local server. We have both 'active' and ‘cold’ (not actively used anymore) datasets from our research projects and we try to maintain these datasets for at least 7 years.
Since many of us are dealing with a large bunch of NGS data from projects with TBs in size, I was wondering
- how/where do you guys store your "active" and "cold" HTS data? Do you have an in-house server with TBs of storage facility or online Cloud (e.g. Amazon) or other options?
- is it cost-effective to use cloud-based servers (e.g. Amazon, ASIA) for "cold" data storage and build a small server locally for working with "active" data to mitigate this issue? Any ideas on the cost involved to build a small cluster/server with Tbs of storage?
I have talked to a few of my colleagues and it sounds like everybody is doing it somehow but looking for better options. Since many of us are struggling with HTS data storage with some backup facility, I was wondering if there are any cost-effective solutions? Many organisations are investing quite a lot on eResearch (i.e. data science) and many of us are already know that genomics-based data storage is really a big issue across organisations for researchers and needs more attention.
We might share our ideas and see if we can follow others’ approaches.