6.7 years ago by
The appropriate solution will depend on how much data you envision you will generate and what type of data you need to share. The best advice I heard from experts in the field is to imagine what you will generate in the next couple of years and multiply that by at least factor of 5, for example (we can't predict the future, so that may be way off).
We use a RAID 50 storage array (not for computing against) and we are just keeping the raw Fastqs and BAMs (and of course back up analyses). I've tried to follow the development of compression/storage solutions like CRAM, but I'm not convinced any of these will become so common that these formats will be easy to share/use in the future so we just keep raw data, alignments, etc. The high volume faster disks for computing against are considerably more expensive, and we have less storage on them. You want to minimize moving big amounts of data around, so try to design a system where your storage is connected to where you will analyze that data. For sharing data, make it available for shipping on hard drives via regular mail. Your IT people probably won't want you to tie up the bandwidth with large/numerous downloads and your collaborators won't want to spend >1000 hrs. getting the raw data. Of course, if you need to just share assemblies or annotations, those can easily be placed on your server for download.
I'm not aware of any journals that want/require you to deposit the raw data anywhere for large sequencing projects these days, so it's probably up to you to maintain copies of your data. This is a very timely question, and I'd be interested to hear what other labs/institutes are doing for setting privileges for each member of the group to certain data sets, and how it is stored/analyzed/shared.
modified 6.7 years ago
6.7 years ago by
SES ♦ 8.2k