Forum: NGS Data storage solutions for small organisations or big labs
2
gravatar for bioinfo
6 days ago by
bioinfo710
New Zealand
bioinfo710 wrote:

Hi all,

I’m trying to find out and compile info on how big labs (e.g. 40+ people) or small organisations who produce large amounts of NGS data manage the data storage issue. As a small organisation, we produce 10-15 TBs of HTS data every year from large genomics and metagenomics projects. We are often dependent on an external HPC cluster service provider for storing our TBs of data in parallel to doing analysis there, along with a few more Tbs on local server. We have both 'active' and ‘cold’ (not actively used anymore) datasets from our research projects and we try to maintain these datasets for at least 7 years.

Since many of us are dealing with a large bunch of NGS data from projects with TBs in size, I was wondering

  • how/where do you guys store your "active" and "cold" HTS data? Do you have an in-house server with TBs of storage facility or online Cloud (e.g. Amazon) or other options?
  • is it cost-effective to use cloud-based servers (e.g. Amazon, ASIA) for "cold" data storage and build a small server locally for working with "active" data to mitigate this issue? Any ideas on the cost involved to build a small cluster/server with Tbs of storage?

I have talked to a few of my colleagues and it sounds like everybody is doing it somehow but looking for better options. Since many of us are struggling with HTS data storage with some backup facility, I was wondering if there are any cost-effective solutions? Many organisations are investing quite a lot on eResearch (i.e. data science) and many of us are already know that genomics-based data storage is really a big issue across organisations for researchers and needs more attention.

We might share our ideas and see if we can follow others’ approaches.

ADD COMMENTlink modified 1 day ago by i.sudbery4.8k • written 6 days ago by bioinfo710
1
gravatar for genomax
6 days ago by
genomax68k
United States
genomax68k wrote:

For us active data is always on high performance local cluster storage. We are a bigger organization/sequencing center and have access to plenty of storage (not infinite but adequate for ~6 months, hundreds of TB). We also use a large quantum tape library solution that is presented as storage partition. Data copied there automatically goes on tapes. We keep them for 3 years.

You can consider cold storage on google or AWS. While cold cloud storage is cheap, you will incur a cost to retrieve the data, which can be expensive. You can consider converting data to uBAM or CRAM (if a reference is available) to save on space in general.

If data is going to be published you would eventually want to submit it to SRA/ENA/DDBJ so you can store a copy there. There is a facility to embargo it until publication (or at least 1 yr I think) so you are covered.

ADD COMMENTlink modified 6 days ago • written 6 days ago by genomax68k
1
gravatar for i.sudbery
1 day ago by
i.sudbery4.8k
Sheffield, UK
i.sudbery4.8k wrote:

I think this is an important, and unsolved, problem.

For "burning hot" (upto 2 months lifespan), we use the centrally managed HPC scratch space, which has 600TB of space (lustre, connected by infiniband directly to compute nodes), but is shared across the entire institution (about 1000 users).

For "hot" data, we use a high-performance, cluster-out storage cluster (NetApp) run and managed by the institution on which we buy space at $300/TB/year for mirrored and backed up storage we currently have 20TB here and I expand it whenever I have spare cash lying around.

For cold data we use cloud (where legally allowed). My institution has an agreement with the cloud provider where we have no limit on the amount of space we can use, but we have a daily up/down bandwidth limit of 1 TB per research group.

As noted by @genomax, in the longest of terms we rely on SRA/ENA to store our raw data. Our biggest problem is that raw data is generally only a small fraction of the data associated with a project. A project with 100GB of raw data can easily produce over a 1TB of analysis products. Some of these can be safely discarded, but its hard to know which - and other have to be retained for record keeping purposes. Its this intermediate "grey" data that really poses a problem.

ADD COMMENTlink modified 1 day ago • written 1 day ago by i.sudbery4.8k
0
gravatar for harold.smith.tarheel
6 days ago by
United States
harold.smith.tarheel4.4k wrote:

My decidedly heterodox position (probably due to my foundational training in molecular biology) is to store the 'cold' data as library DNA in a -80˚ freezer. DNA is a technologically stable and incredibly information-dense platform - a small freezer could easily accommodate petabytes-to-exabytes equivalent of data at a fraction of the price of digital media. Plus, storage costs for most cold datasets are wasted, in the sense that they'll never be reanalyzed, which makes resequencing of the few reusable ones cost-effective.

But I've found that most users of our sequencing facility are strongly opposed to this suggestion - I would be interested to hear feedback from the Biostars community.

ADD COMMENTlink modified 6 days ago • written 6 days ago by harold.smith.tarheel4.4k

I agree in principal with your solution but it may only be viable for an individual lab. Sequencing facilities deal with tens of thousands of samples and storage of libraries at -80C for years quickly becomes unwieldy.

ADD REPLYlink modified 6 days ago • written 6 days ago by genomax68k

I think that DNA should be stable long term at RT under proper storage conditions.

ADD REPLYlink written 6 days ago by Jean-Karim Heriche19k

You'd be surprised how easy it is. 40K+ clones (two whole-genome RNAi libraries for C. elegans) fit in a couple freezer racks, and we retrieve samples from those regularly (much more frequently than cold datasets).

ADD REPLYlink modified 6 days ago • written 6 days ago by harold.smith.tarheel4.4k

I guess it depends on your use case - i've never had a data set I havn't gone back to at least once a year.

ADD REPLYlink written 1 day ago by i.sudbery4.8k
0
gravatar for colindaven
1 day ago by
colindaven1.3k
Hannover Medical School
colindaven1.3k wrote:

We have similar levels of data generation, about 20-30TB a year at present. We can't go for cloud data storage due to legal reasons.

24 TB SSD - hot data. On compute cluster
||
100-150 TB online (warm data). Local SLURM cluster. Mix of scratch partitions, and snapshotted Netapp.
||
10 TB tape (cold data, backup from warm to tape every 6-12 months).

We also backup to two local 60 TB MooseFS storage run on a) 60 TB internal RAIDs spread across 6 workstations and b) 60 TB of very large (6-8 TB) external hard disks spread across 3 workstations. Slow but seems stable and it manages two redandant copies of each chunk. Open source software as well, so definitely a very cheap option, but no deduplication.

I am starting to look at ZFS because of it's snapshotting and deduplication properties.

ADD COMMENTlink written 1 day ago by colindaven1.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1802 users visited in the last hour