Forum: Tips on: How to organise data on servers
gravatar for flight505
4 months ago by
flight50570 wrote:

I am looking to hear anyone and everyone you don't have to have built a data warehouse but you have worked on servers that were meticulously maintained and organised - also long as you can explain what you liked, disliked features you would have liked when you are definitely qualified to comment to this post :-)

We are currently building a smaller data warehouse and could use advice from others particularly on how they organised their data. The warehouse provides dedicated nodes with decent memory and layers of security to comply with numerous regulations. It will store GWAS data such as raw genotype data and QC´ed data and function as a workspace.

We have previously had our raw genotype data stored on servers without having a good order or structure in play. Besides ending up with several duplicate entries of raw data and QC´ed data, we also had severe problems with handing over information, for example, students who finished or otherwise left. usually, the folder only contained the data and information about how and what was done to the data or where it was obtained from was kept on in someone's mind or written down in a thesis. Now some of it might be solvable just by adding and enforcing others to add a README files to each folder (it would be a start). But in some cases, it might also to have other structures that cross-referenced or helped organise biomarkers, phenotype data, cohorts.


Thanks for your time and input

ADD COMMENTlink modified 4 months ago by RamRS26k • written 4 months ago by flight50570

I've moved it to a Forum post as this is more a discussion than a question with a finite number of "correct" answers.

ADD REPLYlink written 4 months ago by RamRS26k

There are some papers on the subject:

Ten Simple Rules for Creating a Good Data Management Plan

Good enough practices in scientific computing

ADD REPLYlink written 4 months ago by h.mon29k

Great question! Also interested to hear from people.

From my personal experience as an end user, finding the correct data was troublesome because no master file existed that outlined where to find the file, what was done to it, when it was produced/how, other metadata (what specimen it came from) etc.

ADD REPLYlink written 4 months ago by Amar640

Thanks for the upvote. When I started my PhD it was in a new location, I had zero experience with the group and everything was more or less handed down via word to mouth. It took me three months just to know who I should ask to get access and what I could perhaps gain access to.. What I wouldn't give for having a Tldr or welcome package..

ADD REPLYlink modified 4 months ago • written 4 months ago by flight50570

We have currently implemented a simple spec/tool for storing and managing our reference genome datasets and you can probably try to use our approach or leverage it. Perhaps it might be useful: The link to the tool is here This is how the browsable interface looks like Its pretty simple to implement mainly using git and github and a small python codebase. Hope this is useful for your purposes

ADD REPLYlink written 4 months ago by ashok.ragavendran40
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 707 users visited in the last hour