Forum: Tips on: How to organise data on servers
2
gravatar for flight505
17 days ago by
flight50560
Denmark/DTU
flight50560 wrote:

I am looking to hear anyone and everyone you don't have to have built a data warehouse but you have worked on servers that were meticulously maintained and organised - also long as you can explain what you liked, disliked features you would have liked when you are definitely qualified to comment to this post :-)

Problem:
We are currently building a smaller data warehouse and could use advice from others particularly on how they organised their data. The warehouse provides dedicated nodes with decent memory and layers of security to comply with numerous regulations. It will store GWAS data such as raw genotype data and QC´ed data and function as a workspace.

We have previously had our raw genotype data stored on servers without having a good order or structure in play. Besides ending up with several duplicate entries of raw data and QC´ed data, we also had severe problems with handing over information, for example, students who finished or otherwise left. usually, the folder only contained the data and information about how and what was done to the data or where it was obtained from was kept on in someone's mind or written down in a thesis. Now some of it might be solvable just by adding and enforcing others to add a README files to each folder (it would be a start). But in some cases, it might also to have other structures that cross-referenced or helped organise biomarkers, phenotype data, cohorts.

-

Thanks for your time and input

ADD COMMENTlink modified 17 days ago by RamRS25k • written 17 days ago by flight50560
1

I've moved it to a Forum post as this is more a discussion than a question with a finite number of "correct" answers.

ADD REPLYlink written 17 days ago by RamRS25k
1

There are some papers on the subject:

Ten Simple Rules for Creating a Good Data Management Plan

Good enough practices in scientific computing

ADD REPLYlink written 17 days ago by h.mon28k

Great question! Also interested to hear from people.

From my personal experience as an end user, finding the correct data was troublesome because no master file existed that outlined where to find the file, what was done to it, when it was produced/how, other metadata (what specimen it came from) etc.

ADD REPLYlink written 17 days ago by Amar620
1

Thanks for the upvote. When I started my PhD it was in a new location, I had zero experience with the group and everything was more or less handed down via word to mouth. It took me three months just to know who I should ask to get access and what I could perhaps gain access to.. What I wouldn't give for having a Tldr or welcome package..

ADD REPLYlink modified 17 days ago • written 17 days ago by flight50560

We have currently implemented a simple spec/tool for storing and managing our reference genome datasets and you can probably try to use our approach or leverage it. Perhaps it might be useful: The link to the tool is here https://compbiocore.github.io/refchef/ This is how the browsable interface looks like https://compbiocore.github.io/refchef-view/ Its pretty simple to implement mainly using git and github and a small python codebase. Hope this is useful for your purposes

ADD REPLYlink written 17 days ago by ashok.ragavendran40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 745 users visited in the last hour