Downloading bulk amount of data from ftp site
Entering edit mode
9.2 years ago
bioinfo ▴ 840

We are planing to download ridiculously bulk amount of metagenomes dataset from MG-RAST (over 2000 metagenomes, raw data). I need around 5 TB of storage for the sequences data, as I noticed that 80 metagenomes have already consumed 200 GB of space (zipped, downloaded from API site). I guess that I also need few Terabytes of space for data analysis and results. We are already ran out of disk space. Considering our current storage situation on the server, one of my colleagues suggested that - "store only FTP-addresses etc. to all the files there, so you can go back for them should you need to in the future. Storing all the data is problematic. That is that we never store any DNA sequence data on Our server, only the addresses where we got it from".

Before, that I thought ftp sites are to download data but how does it work "storing ftp-addresses"? instead of downloading sequence data for downstream analysis such as running blast etc against NCBI nr databse. Any suggestions?

How do you guys store large datasets on the server?

fastq MG-RAST • 3.5k views
Entering edit mode
9.2 years ago

I will start by saying that 5TB is a pretty small dataset in the genomics era, so if you are going to be working with genomics datasets, consider procuring some storage.

I think what you colleague is suggesting is that since the samples are independent of each other, you needn't download all the data at one time. If you have a script that normally starts with a file name, just have your script start instead with "download the file". At the end of the processing, simply remove the original file since it is present on the ftp server.

An alternative approach is to use the cloud for this type of work since you can easily access large storage and as much compute as you need. When you are done, you simply remove everything except the processed data.

Entering edit mode

Agree with the comments as above. One other alternative is to stream files from FTP locations into your code, if your code/ programs can work with streaming data.

Cloud seems most scalable.



Login before adding your answer.

Traffic: 1189 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6