Question

Forum:Tips on: How to organise data on servers

3

Entering edit mode

4.7 years ago

flight505 ▴ 110

I am looking to hear anyone and everyone you don't have to have built a data warehouse but you have worked on servers that were meticulously maintained and organised - also long as you can explain what you liked, disliked features you would have liked when you are definitely qualified to comment to this post :-)

Problem:

We are currently building a smaller data warehouse and could use advice from others particularly on how they organised their data. The warehouse provides dedicated nodes with decent memory and layers of security to comply with numerous regulations. It will store GWAS data such as raw genotype data and QC´ed data and function as a workspace.

We have previously had our raw genotype data stored on servers without having a good order or structure in play. Besides ending up with several duplicate entries of raw data and QC´ed data, we also had severe problems with handing over information, for example, students who finished or otherwise left. usually, the folder only contained the data and information about how and what was done to the data or where it was obtained from was kept on in someone's mind or written down in a thesis. Now some of it might be solvable just by adding and enforcing others to add a README files to each folder (it would be a start). But in some cases, it might also to have other structures that cross-referenced or helped organise biomarkers, phenotype data, cohorts.

Thanks for your time and input

genotype warehouse servers • 1.0k views

ADD COMMENT • link updated 13 months ago by Ram 44k • written 4.7 years ago by flight505 ▴ 110

1

Entering edit mode

I've moved it to a Forum post as this is more a discussion than a question with a finite number of "correct" answers.

ADD REPLY • link 4.7 years ago by Ram 44k

1

Entering edit mode

There are some papers on the subject:

Ten Simple Rules for Creating a Good Data Management Plan

Good enough practices in scientific computing

ADD REPLY • link 4.7 years ago by h.mon 35k

0

Entering edit mode

Great question! Also interested to hear from people.

From my personal experience as an end user, finding the correct data was troublesome because no master file existed that outlined where to find the file, what was done to it, when it was produced/how, other metadata (what specimen it came from) etc.

ADD REPLY • link 4.7 years ago by Mark ★ 1.5k

1

Entering edit mode

Thanks for the upvote. When I started my PhD it was in a new location, I had zero experience with the group and everything was more or less handed down via word to mouth. It took me three months just to know who I should ask to get access and what I could perhaps gain access to.. What I wouldn't give for having a Tldr or welcome package..

ADD REPLY • link 4.7 years ago by flight505 ▴ 110

0

Entering edit mode

We have currently implemented a simple spec/tool for storing and managing our reference genome datasets and you can probably try to use our approach or leverage it. Perhaps it might be useful: The link to the tool is here https://compbiocore.github.io/refchef/ This is how the browsable interface looks like https://compbiocore.github.io/refchef-view/ Its pretty simple to implement mainly using git and github and a small python codebase. Hope this is useful for your purposes

ADD REPLY • link 4.7 years ago by ashok.ragavendran ▴ 50