Question: Medium Sized Data Backup Strategies
7
gravatar for Niallhaslam
8.1 years ago by
Niallhaslam2.2k
Dublin
Niallhaslam2.2k wrote:

Hi all,

I'm interested to hear people's perspectives on backing up the contents of their groups workstations, laptops etc in some kind of co-ordinated way. Hopefully on something that could be applied to a medium sized group. My group is a mix of bioinformatics and wet lab scientists. The work in the lab is backed up on servers - i.e. the raw data. The big analysis stuff is all done on clusters and HPC so that is well backed up as well. But the gap that exists is for the like of group presentations, small analysis and smaller projects. Code is backed up using version control.

I know there is a lot of chat about NGS and the storage requirements there, but this is a different problem -that has probably been solved before however I feel its worth revisiting to see if anyone has found an easier solution. In previous jobs I've just rsync'd /home/ to an offsite computer and forgot about it however in a heterogenous work environment its not possible to do this for the whole group which I would like to be able to do. Like I say the mission critical stuff is pretty recoverable but I would like to increase the recoverability of the rest.

First off the specs - 10-15 users. Each with say 100Gigs of random datasets, code, analysis, presentations, papers, manuscripts etc that may lurk on a mix of Windows, Mac and Linux laptops and desktops. Has anyone any experience setting up a Network Attached Storage (NAS) system for example where all users could read/write to central NAS server? Any pitfalls?

What backup systems have others in place for disaster recovery of their workstations/laptops?

Has anyone successfully implemented a group based strategy for a heterogenous work environment in a uni setting (i.e. without paying through the nose).

Currently looking at buying a NAS and placing it in a building at the other end of campus and filling it with traditional HDs. Synology, Drobo would be examples of what I mean.

I should say I back up my own stuff daily - but this is about having something stable for the group, who aren't as paranoid about data plans. I went to Uni in Southampton when one of the Comp Sci buildings went down: [http://www.ecs.soton.ac.uk/podcasts/video.php?id=46][BBC]

data • 3.3k views
ADD COMMENTlink modified 8.0 years ago by Giovanni M Dall'Olio26k • written 8.1 years ago by Niallhaslam2.2k

When you've had a failure is not the time you want to find out your backups weren't functioning correctly.

No matter what you choose, test it periodically.

Backups are easy, restores are hard.

ADD REPLYlink written 8.1 years ago by Gareth Palidwor1.6k

I just wanted to add that I haven't chosen a right answer yet as all of the suggested solutions solve different aspects of the problem very well.

ADD REPLYlink written 8.0 years ago by Niallhaslam2.2k
8
gravatar for Istvan Albert
8.1 years ago by
Istvan Albert ♦♦ 79k
University Park, USA
Istvan Albert ♦♦ 79k wrote:

I found DropBox as an ideal and non-intrusive way to back up relatively small fragmented datasets distributed over a wide variety of platforms.

Each user sets up their own dropbox instance and makes sure to save the data that needs to be backed up into a filepath that is monitored by it. For larger datasets a dedicated solution is needed, but those never work well for lots of small fragmented pieces of information.

One important (but often forgotten) aspect of centralized large scale backups is that of privacy. Who can see and recover some of the information that you may not want to be accessible to others.

ADD COMMENTlink written 8.1 years ago by Istvan Albert ♦♦ 79k
3

Dropbox is not limited to 2GB per user. That's all you will get for Free. Storage of any quantity, tends not to be free.

ADD REPLYlink written 8.1 years ago by Daniel Swan13k
1

Dropbox seems to be limited to 2Gb per user. I use it for some stuff already and like it.

The point about privacy (permissions) is very important though, and difficult to solve for heterogenous computing labs. Makes me nostalgic for the pure linux days!

ADD REPLYlink written 8.1 years ago by Niallhaslam2.2k
6
gravatar for Aleksandr Levchuk
8.1 years ago by
United States
Aleksandr Levchuk3.1k wrote:

Centralized

Having everyone log-in to one central Biocluster with a dedicated NAS has many benefits:

  1. One common environment to learn
  2. More software ready to use
  3. One issue = one troubleshooting
  4. Getting a new user started takes minutes
  5. All data is in one place

This last point makes it easy to backup data securely and efficiently. A simple rsync to a heavily fire-walled off-site computer with cheap DAS storage shelves will be good enough. Here is an old but well tested Howto: Easy Automated Snapshot-Style Backups with Linux and Rsync

The rsync approach may not be the most efficient way because it needs to go through your entire dataset every night to figure out the differences. Related operations on snapshots also take a long time. On the other hand is a Copy-on-write transactional filesystem like ZFS (see ZFS for NGS data analysis). ZFS already knows what changed throughout the day - to make a backup it just needs to replay the log. Things are a little more complicated with ZFS because it does not run on Linux yet. I'm looking into switch to ZFS backups anyway because the hardware that will send and receive the backups will not be running anything else. For those parts of the infrastructure I can choose any operating system (OpenSolaris, OpenIndiana, or FreeBSD).

Two pitfalls that I encountered and solved when using a NAS:

  1. Adding workstations to the compute infrastructure is a bad idea. It's OK if the compute nodes freeze for an hour when incidents occur (e.g. while the NFS server is fixed and rebooted) - nobody will even notice. On the other hand, if you have workstations on the same NAS then the lab will be completely paralyzed (even all web-browsers will freeze). There are several other reasons to keep the workstation off the NAS. Get everyone used to keeping the data on the central compute server.
  2. Running NFS server on the head-node is also a bad idea. You should have a dedicated hardware for NFS. The performance will go up. The number of freez-ups will go down. Fixing things will be easier. You will have more flexibility. You can run the backup scripts from this server.

Decentralized

If you cannot avoid an environment where the data is spread across many workstations and laptops then I don't know any good solutions.

HashBackup would have been a good solution if it was usable. Unfortunately it's in beta which expires every 4 month.

HashBackup encrypts the data so would have when well with what Istvan said about privacy. Unfortunately, it's not open source. I haven't started used it for the openness reason when I first learned about it 1 year ago. I don't trust single-person projects that are not open source. After all, it was a good choice not to trust it because the developer started doing the 4 month expiration trick.

It's a good approach. Perhaps there are similar commercial solutions that do this (maybe the one that @Brad uses).

ADD COMMENTlink modified 8.1 years ago • written 8.1 years ago by Aleksandr Levchuk3.1k
1

+1 For the dissertation :)

ADD REPLYlink written 8.1 years ago by Eric Normandeau10k

+1 for dedicated NFS server. Learned the hard way.

ADD REPLYlink written 8.1 years ago by Daniel Swan13k
4
gravatar for Brad Chapman
8.1 years ago by
Brad Chapman9.4k
Boston, MA
Brad Chapman9.4k wrote:

Our department uses Crashplan:

http://www.crashplan.com/

Setup and installation is simple and there are clients for Windows, Mac and Linux. It's not free, but pricing is reasonable.

ADD COMMENTlink written 8.1 years ago by Brad Chapman9.4k

Its not free for the cloud backups - but from looking at the site it seems to suggest that mirroring of computers to each other or to a NAS should be free? Is that right - if so sounds great!

ADD REPLYlink written 8.1 years ago by Niallhaslam2.2k

I think it is free to get started backing up locally. We run CrashPlanPro so I don't have a lot of experience with all the functionality of the free version, but if it is anything like Pro all runs smoothly and easily. It sounds like you could get started with the free version and see if it works for you; then you'll have a better idea if Pro makes sense. Glad this helps.

ADD REPLYlink written 8.1 years ago by Brad Chapman9.4k
3
gravatar for Daniel Swan
8.1 years ago by
Daniel Swan13k
Aberdeen, UK
Daniel Swan13k wrote:

After having got a bit cheesed off with flaky RAID systems (in the 3-10TB range) and frustratingly complex tape backup systems we went out and bought 2x20TB Viglen RAID systems, plenty of room for expansion.

RAID is not backup, but effectively the second RAID unit is the disk equivalent of a tape library. Learned lessons with RAID5 systems, so these are RAID6 with hot spares.

backup2l provides an incredibly simple backup solution between the two machines. Currently working a treat and a weight off my mind.

ADD COMMENTlink written 8.1 years ago by Daniel Swan13k

+1 for RAID6 with hot spares

ADD REPLYlink written 8.1 years ago by Aleksandr Levchuk3.1k

+1 for mentioning that RAID is not necessarily a backup solution. Explanation: if you delete a file, it gets deleted from all mirrored disks.

ADD REPLYlink written 8.1 years ago by Michael Schubert6.9k
2
gravatar for Casey Bergman
8.1 years ago by
Casey Bergman18k
Athens, GA, USA
Casey Bergman18k wrote:

For backup of ~10 Linux and Mac laptops & workstations in our group (our central admin backs up our cluster and anything on it to tape), we use a Synology DS1010 base + DX510 expansion unit, each with 5 x 2Tb SATA HDDs. We have the base and expansion unit configured as separate volumes, since if configured as one single volume, failure of one device crashes the other. When configured as RAID5, each volume has 7.15 TB capacity, So the whole unit has 14.3 Tb for about £1800 total from microdirect.co.uk.

In terms of set up and use the Synology is extremely easy. We had it up and running in an hour out of the box, and it can be mounted using NFS easily as well as using AFP from Mac's (it has support for Windows too, but I haven't tried this yet). Back-ups are done via rsync. No troubles with this device since Aug 2010. So in terms of price, stability and ease of use, I can certainly recommend this unit for a cheap NAS back-up system on the scale you are interested.

ADD COMMENTlink written 8.1 years ago by Casey Bergman18k
1

+1 for the good price. How do you let the users access the laptop backups? If a laptop gets compromised, would the adversary be able to remove the backup (none, only 1 laptop, entire lab)?

ADD REPLYlink written 8.1 years ago by Aleksandr Levchuk3.1k
1
gravatar for Giovanni M Dall'Olio
8.1 years ago by
London, UK
Giovanni M Dall'Olio26k wrote:

We keep a copy of all the code on bitbucket repositories; it is good because there are no disk limit space, but I don't recommend it for very big files.

The data, sad but true, is backup on external hard drives (I make incremental backups every day) and on a cluster.

Since we do not work with huge data, until the scripts are stored on a remote repository and we can access them from anywhere, we can redownload or recreate any result starting from there.

ADD COMMENTlink written 8.1 years ago by Giovanni M Dall'Olio26k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1879 users visited in the last hour