Question: How I deal with this big data
1
gravatar for F
4 weeks ago by
F3.4k
Iran
F3.4k wrote:

Hi,

I have 40 whole genome sequencing .bam files each one about to 1 TB. Lab manager has downloaded them from Cambridge university to her hard derive. In the other hand my computer in office is windows OS so I have to use linux OS in our Compute Cluster in university to analyze my data. For that I have to drag and drop each .bam file to my scratch in Compute Cluster that each file takes about 2 days for transferring and after finishing I am seeing .bam file becomes crashed. Now, it is about a month I have been given this data but I am still struggling with transferring them. If one of you were in my place what would you do?

I thought to ask lab manager to download these .bam files directly to Compute Cluster (but she has not done yet) I thought to ask IT service to install linux on my computer although I guess I would need cluster computing again (

I really don't know what to do

Any consultant please?

wgs transfer • 259 views
ADD COMMENTlink modified 4 weeks ago by ATpoint15k • written 4 weeks ago by F3.4k
1

Why not ask your IT service how they advise uploading to their cluster?

ADD REPLYlink written 4 weeks ago by jrj.healey11k

I asked, they are saying this is a common problem and I should use MobaxTerm to connect to HPC and drag and drop (I did and failed). Also they are saying we can mount our private filestore to HPC and from that we can copy and paste files to scratch (filestore mounted but I need permission)

ADD REPLYlink written 4 weeks ago by F3.4k
1

If you really have 40Tb, don't drag/drop. Use rsync or this is gonna be a nightmare. And don't let this manager convince you otherwise. If this is really his/her advice he/she has little experience with large data handling.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by ATpoint15k

See if they're willing to just do it for you for a given fee. You have better things to spend your time on than this and they're moving data around all the time anyway.

ADD REPLYlink written 4 weeks ago by Devon Ryan89k
1

Is it not possible to directly transfer from Cambridge university to your local cluster via ssh ( rsync should be my shot) ?Otherwise go to a place where you have a fast connection (min 1 Gbs) to your cluster and launch an rsync to transfer your data.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by Nicolas Rosewick7.5k
1

If your computer is on wireless and you trying to transfer TB of data then this is a fool's errand. At least find a computer with wired ethernet. Most campuses should have gigabit ethernet to desktop (at least to some ports, if not all).

As has been said already you should sftp/wget/curl this data directly to the server. Ask lab manager for credentials if they are needed to download data.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by genomax65k
1

40TB, guessing each hard drive is 2TB, it is 20 hard drives. We could use WinSCP to transfer from Windows machine to linux scratch one at a time, depending on the speed will take about 20 days. Best to consult with IT, either direct download to scratch from Cambridge or let IT copy the drives.

Also, do you have access to 40TB space on scratch?

ADD REPLYlink written 4 weeks ago by zx87547.1k

Sorry in front of each .bam file for example says , 105,905,561 kb and I have 40 .bam files; sorry if I am stupid in calculating how big they are

In scratch I have 4 TB space

ADD REPLYlink written 4 weeks ago by F3.4k

105,905,561 kb is ~0.1 terabyte so you have a total of 4 terabytes of data.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by genomax65k

Sorry sometimes I am getting too stupid

ADD REPLYlink written 4 weeks ago by F3.4k

No worries. Now the problem has been identified, solutions provided. You need to find/talk with the right people and execute a solution that works.

ADD REPLYlink written 4 weeks ago by genomax65k

As I am on windows system I am using MobaxTerm to connect to computing cluster so I can not use rsync otherwise I ask an office mate to connect my hard derive to his linux to be able to execute rsync

ADD REPLYlink written 4 weeks ago by F3.4k

In any case don't try the transfer unless you have a wired ethernet connection to prevent timeouts etc.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by genomax65k

What makes you think you can't use rsync in Moba?

ADD REPLYlink written 4 weeks ago by jrj.healey11k

Because I am not able to figure out how to point to my hard derive in this code, how this code knows where are my files? My .bam files are in an external big hard derive next to me

ADD REPLYlink written 4 weeks ago by F3.4k
2

You can access the locally mounted external drive under /mnt in MobaXTerm. Use the same drive letter you see under windows. e.g./mnt/g/your_bam_files.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by genomax65k

As I am rubbish in command lines I installed FileZilla and WinSCP, I am trying them waiting for tomorrow to see if any file be transferred or not :(

ADD REPLYlink written 4 weeks ago by F3.4k
2

It is no more difficult than opening a local terminal in MobaXterm and typing

rsync -axv --numeric-ids --progress -e "ssh -T -o Compression=no -x" /mnt/g/*.bam your_user_name@your_server_name:/folder_name_where_you_want_to_copy

Copy one file to begin with instead of *.bam, until you become comfortable.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by genomax65k
3
gravatar for ATpoint
4 weeks ago by
ATpoint15k
Germany
ATpoint15k wrote:

That sounds like some big amount of data and therefore it will take time. I do not think that there is any workaround for transferring them to the HPC scratch as a desktop computer is simply not powerful enough to handle these amounts of data. For transferring data from A to B I use rsync which will keep the transferred file hidden until the transfer has been finished successfully. I prefer the following command:

rsync -axv --numeric-ids --progress -e "ssh -T -o Compression=no -x" *.bam user@path_to_hpc(...):/scratch/your_username/folder...

This will take time but --progress will give you a rough estimation for each file. Check if it might not be faster to directly download to the HPC. Talk to the people involved. Depending on how fast that drive is where the data are currently stored, a new download directly to scratch might save you quite some time.

ADD COMMENTlink modified 4 weeks ago • written 4 weeks ago by ATpoint15k

Sorry for being stupid but if .bam files are in a hard drive can I still use rsync ?

ADD REPLYlink written 4 weeks ago by F3.4k

Yes. For the end-user rsync is a much more elaborate version of cp as explained here.

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by ATpoint15k

Thank you so much gentlemen for helping me, finally I am transferring files by command line and the speed seems reasonable. I am looking forward for the next generation of my posts here in data analysis step! :)

ADD REPLYlink written 29 days ago by F3.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 670 users visited in the last hour