How to deal with intermediate files
1
1
Entering edit mode
19 days ago

Sometimes during an analysis we create intermediate files that take a long time to generate (results of numerical analyses, bioinformatics pipelines, etc) and that can subsequently be used for graphs and reports. Where do you save those so that they can be used from a different computer or shared with a collaborator?

For example, I may do compute-intensive long analyses on our HPC, but then on my laptop I would do the report with graphs, etc. The code and report code I can share between HPC and my laptop and collaborators using git, but how about the analyses results? What do you do? I don't have a good systematic method (I rely on rsync). Furthermore, in collaborations (or just between my HPC and the laptop) it is easy to share the code using git, but how about sharing intermediate results files?

Any suggestions appreciated.

procedures data operations • 637 views
ADD COMMENT
1
Entering edit mode
19 days ago
dthorbur ★ 3.0k

There are a bunch of options, and I've seen a lot of different methods over the years. Here are a few of the common ones I've seen/used:

  1. Most universities I've worked at have a subscription to one of the major file sharing services (Google Drive, OneDrive, DropBox, etc...).
  2. SFTP/SCP.
  3. Globus.
  4. Sending physical hard drives (the least efficient, but somehow a sequencing company once thought this was appropriate).

It really depends on the number, size, and sensitivity of the files. Email will suffice for most small files and can be encrypted easily. Otherwise, compressing and encrypting are your best bets to make things more manageable. Sometimes, it's not feasible to share intermediate files, so sending the scripts used to generate output from raw data with appropriate seeds sets should generate the same data, but then you probably should to also send containers to do the analysis in to ensure consistency. Useful for things that generate silly amounts of data like linkage analyses (for example).

ADD COMMENT
0
Entering edit mode

Great list. Globus may be the perfect option, if it is already installed on HPC.

Something to consider before using any of the above. Not all (cloud and other) services are certified for sharing personally identifiable data (which in some regions can be plain human sequence). Always check local data security policies (on both ends, to ensure you don't get collaborators in trouble) before using any of the methods mentioned above.

ADD REPLY
0
Entering edit mode

Fair point about the sensitivity of data. I don't work on human or other particularly sensitive data so never really thought about it.

ADD REPLY
0
Entering edit mode

Thank you!! I am aware of all of those, however, they are not exactly what I am looking for: I could transfer data from one place to another, but is there a better way to track, save and share data?. One simpler way that is closer is for example to mount the locations on our HPC to my laptop and hence not having to transfer anything. Also someone just pointed me to this: https://dvc.org/, which seems like a more practical way of doing this. Do you have any other suggestions along these lines.

ADD REPLY
0
Entering edit mode

One simpler way that is closer is for example to mount the locations on our HPC to my laptop and hence not having to transfer anything

Are you doing something on your local computer that is not possible to do on HPC? Local compute can be used as a last resort or for convenience (if you don't want to log in to HPC remotely). If you don't wish to transfer anything then you can give your collaborators access to the storage (folders) on HPC, where the data sits. This may be easy to do via unix groups permissions (they can be fine grained, provided collaborators are local and have access to same HPC) or you could work via cloud (if the collaborators are not local).

which seems like a more practical way of doing this

How so? That software seems to allow you to do version control but you still need to provide storage and/or move the data.

ADD REPLY
0
Entering edit mode

Yes, from my laptop I do the final reports with graphs, etc and all the code is source controlled with git. I am not computing anything on the laptop, just creating reports. I don't necessarily need to discuss my own workflow, which is limited, which is the reason that I am asking. I am more interested in learning about others' workflows and see if there is anything better, what is yours?

ADD REPLY
0
Entering edit mode

Not sure I fully understand your question. Many of us use the local computer as a dumb terminal to get to the HPC where much of the work happens. So most of the time there is no need to move any data. In original post you were asking about how to share intermediate data files, so we were trying to address that part.

In case you are interested in broader question of how to manage data related to individual projects you will find these past threads useful:

file and directory management best practices
How Do You Manage Your Files & Directories For Your Projects ? (this is a much older thread but still should be useful)
https://genomespot.blogspot.com/2021/02/storing-your-sequence-data.html

ADD REPLY

Login before adding your answer.

Traffic: 2358 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6