Forum:Storage Solutions For Genomic Research Centers
3
14
Entering edit mode
8.7 years ago
guillemch ▴ 140

Hi everybody,

I would like to open a discussion about storage solutions that are being used in different genomic research centers. I'll start with our case and why I'd like to know what solutions are other people using around the world.

We have 6 Illumina HiSeq NGS machines and 2 MiSeq. As you well What Are Numbers Every Bioinformatician Should Know?, these machines generate larges amounts of data per day. But not only the amount of data (around a TB per day) is the technological challenge, but also the structure of this data. Generally a sequencing experiment result consists on thousands of small files (images, control files, stats files, etc). Until a few months ago, we were using a traditional file system to store our sequencing data. The data was continuously transferred to the analysis machines in order to be processed using a lsyncd daemon. The problem with this approach is that we need to be continuously transferring the data, we don't have a central location for the data, and there is no data reliability.

In order to solve this problems, we started using a distributed file system, MooseFs. Unfortunately we're experiencing unacceptable transfer rates (around 1GB/hour) when transferring sequencing data. I've been doing some research, and indeed these kind of file systems are not optimised for large amounts of small files. In fact, transferring a tarball of 8GB takes only about 2 mins.

So, I'd like to know what you are using in your centers:

  • What solutions are you using? In terms of file systems specially.
  • Have you experienced the same problems with any other or the same parallel file system?
  • Are you using parallel file systems at all?
  • How do you transfer and keep in synch your data between machines?

You're very welcome to add or ask for any other information. Let's discuss!

Thanks everybody in advance!

P.S: Some related posts that didn't clarify much for me:

genomics data parallel Forum • 5.9k views
ADD COMMENT
0
Entering edit mode

Why have ~15PB of storage servicing 30+ Illumina HiSeqs (and a variety of other platforms). We use GPFS mostly.

ADD REPLY
0
Entering edit mode

Hi Malachi! 30+ Illumina HiSeqs? That sound terrifying in data generation terms! How does GPFS perform? Do you have your machines configured to write directly there, or you use it as a backup solution? Could you please provide some numbers and a bit of information of your infrastructure? I really appreciate that. Thanks!

ADD REPLY
7
Entering edit mode
8.7 years ago

Guy Coates uploaded a lot documents on slideshare related to his experience at Sanger : http://www.slideshare.net/gcoates/presentations

ADD COMMENT
0
Entering edit mode

Thanks !! Its very much useful !!

ADD REPLY
0
Entering edit mode

Thank you ver much!

ADD REPLY
4
Entering edit mode
8.7 years ago

A few more relevant threads from the past can be found here

  1. Tips to build a data storage for bioinformatics
  2. Big data: storage and analysis
ADD COMMENT
0
Entering edit mode

Oh thanks, I did skip these ones :-)

ADD REPLY
0
Entering edit mode
8.7 years ago
always_learning ★ 1.1k

Hi Guillmench

These are few file transferring tools which were used copying files with NGS analysis. 1) http://udt.sourceforge.net/software.html

2) http://monalisa.cern.ch/FDT/

3) Tsunami — http://tsunami-udp.sourceforge.net/

We will looking forward to use these tools to use in future. ASPERA is a solution but that's quite costly also !! But these are open source tools .

ADD COMMENT
0
Entering edit mode

Hi syednajeebashraf, thanks for your answer :-). The tools you propose look really good, however they will not solve our problem if I did understand correctly: These tools optimise data transfer within the network, i.e they may be an improvement over our rsync/lsyncd. The problem though still remains in the underlying file system. Even if you can transfer files very fast, the I/O performance of our filesystem will still be the bottleneck, and I'm afraid that these tools can't help. Thanks again!

ADD REPLY

Login before adding your answer.

Traffic: 2260 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6