Question: How Do You Manage Your Files & Directories For Your Projects ?
50
gravatar for Pierre Lindenbaum
4.0 years ago by
France
Pierre Lindenbaum58k wrote:

People in a laboratory, working one the same project, generate all kind of files (fasta, images, raw data, statistics, readme.txt, etc...) that will be moved in some directories. How do you manage the hierarchy of those directories ?

  • there is no standard hierarchy and the file are dropped anywhere. It all relays on the common knowledge.
  • there is a clearly defined hierarchy (PROJECT_NAME/DATE/machine_user_image_result_but_this_is_the_second_run_because_1st_failed.txt...)
  • files are uploaded on a wiki (you wouldn't do that for large files)
  • there is a central file/wiki answering what/where is a file
  • there is a Readme.txt/describe.xml in each folder.
  • there is a tool (?) for managing this kind of information ?
  • (...) ?

Thanks

Pierre

ADD COMMENTlink written 4.0 years ago by Pierre Lindenbaum58k
4

currently, it's a mess :-)

ADD REPLYlink written 3.4 years ago by Pierre Lindenbaum58k

Pierre: I am wondering how you are managing your files & directories ?

ADD REPLYlink written 3.4 years ago by Khader Shameer14k

Why do you need files anyway? Files are from the 70's, not going to scale these days.

ADD REPLYlink written 2.8 years ago by michaelhavner0
54
gravatar for Giovanni M Dall'Olio
4.0 years ago by
London, UK
Giovanni M Dall'Olio18k wrote:

In my local computer, I have:

  • a 'workspace' folder, in which each sub-folder correspond to a separate project
  • a 'data' folder where I put all the data used by more than a project
  • an 'archive' folder with all finished project

Within each project folder, I have:

  • planning/ -> a folder containing all the files related to the early phase of the project. Usually this is the first folder I create, and here I store all the miscellaneous files (the notes/objectives/initial drafts) that I collect in the first weeks of a projects, when I still am not sure which programs to write.
  • bugs/ -> I used to use ditz to keep track of bugs and To-Dos, but now I use only A7 hand-written papers
  • data/
    • folders containing the different data I need to use, soft-linked from ~/data
  • parameters/ -> ideally, I should have configuration files so if I want to run my analysis on other dataset, I only have to change the parameters here
  • src/ -> with all code
    • a Makefile to re-run all the analysis I wish
    • scripts/ with all the scripts
    • lib/ eventually, if I am reusing code from other projects
    • pipelines/ with all .mk (makefile) files
  • results/
    • tables/ -> tabular-like results
    • plots/ -> plots
    • manuscript/ -> draft for the manuscript, final figures and data, etc..
      • figures/
      • tables/
      • references/

I use git for revision control, to get a log of all the changes I make to scripts and results. Lately I have been reading about sumatra, and planning to give it a try (a slideshow for the curious here)

I still have to decide very well where to put .Rdata files, as I am still a novice to R.

note: you probably know already the article by Plos "A Quick Guide to Organizing Computational Biology Projects"

ADD COMMENTlink modified 2.6 years ago • written 4.0 years ago by Giovanni M Dall'Olio18k
1

I put RData files with my "code" files. (so I load them when I open relevant code files)

ADD REPLYlink written 4.0 years ago by Tal Galili120
1

I usually keep RData files in a "data" directory and use setwd() in my R script to point to it. Doesn't really matter how you do it, so long as source() finds the R script and the R script finds the data.

ADD REPLYlink written 4.0 years ago by Neilfws41k

thanks!! That it is more or less what I am doing now, but I am not sure whether I should create a separate 'Rdata' directory.

ADD REPLYlink written 4.0 years ago by Giovanni M Dall'Olio18k

I only store useful or large objects (like huge matrices from acgh) in .RData files that will then take less disk space. So, sometimes .Rdata files replace original text files in my 'data' folder. This was just a remark...

ADD REPLYlink written 4.0 years ago by Tony1.7k

I asked this question 8 months. It's time to validate the most voted answer :-)

ADD REPLYlink written 3.3 years ago by Pierre Lindenbaum58k

thank you very much :-)

ADD REPLYlink written 3.3 years ago by Giovanni M Dall'Olio18k
14
gravatar for Jeremy Leipzig
4.0 years ago by
Philadelphia, PA
Jeremy Leipzig12k wrote:

one related tip is a handy bash script I got from:

http://dieter.plaetinck.be/per_directory_bash_history

which produces directory-specific bash histories (instead of one giant global history)

whenever I enter a directory I can easily access everything I ever did there, which is priceless when I am trying to remember what I actually did

ADD COMMENTlink written 4.0 years ago by Jeremy Leipzig12k

a version that allows multiple users to view each others histories: http://jermdemo.blogspot.com/2010/12/directory-based-bash-histories.html

ADD REPLYlink modified 8 months ago • written 3.3 years ago by Jeremy Leipzig12k
9
gravatar for Casbon
3.5 years ago by
Casbon2.7k
Casbon2.7k wrote:

I follow this data plan

ADD COMMENTlink written 3.5 years ago by Casbon2.7k

Fantastic! Thanks for sharing!

ADD REPLYlink written 3.4 years ago by None90

Excellent - made me laugh! Or was that cry.

ADD REPLYlink written 3.2 years ago by Niallhaslam2.2k
9
gravatar for Rvosa
3.3 years ago by
Rvosa520
Leiden, the Netherlands
Rvosa520 wrote:

The following article discusses this exact question, and gives useful tips (which I now follow):

William Stafford Noble 2009. A Quick Guide to Organizing Computational Biology Projects. PLoS Comput Biol 5(7): e1000424. http://dx.doi.org/10.1371/journal.pcbi.1000424

ADD COMMENTlink written 3.3 years ago by Rvosa520
8
gravatar for Yuri
4.0 years ago by
Yuri1.1k
Bethesda, MD
Yuri1.1k wrote:

Great question, thanks!

In my opinion there are several layers of files, and different approaches should be applied on each level. Here how it's organized in our lab.

  1. Raw data (microarrays, for example)
    • Files are named and stored in clearly defined schema
    • Regular backup is mandatory
    • Some files, which we probably will not use further (like Affymetrix DAT files) are archived.
    • Access to the files is controlled
    • General information on experiments is stored in LIMS (we are using Labmatrix, but it's commercial)
    • We also store some preprocessed data (normalization, for example) if the procedure is clearly defined as SOP.
  2. Temporary data (ongoing analysis)
    • Basically everybody are on their-own here. The files are usually stored locally and everyone responsible for their backup. I can access the data I need remotely (from home, for example).
    • I do keep some hierarchy based on projects, data type and analysis, but it's not strict and project-dependent.
    • I found Total Commander to be very useful for files management. For example, I can write a small comments for every file (Ctrl-Z), it's stored in a text file, and if I copy or move some file, the description goes with it.
    • Files to share for project team we keep on network shared drive with regular backup.
  3. Results to share (documents, figures, tables, ...)
    • We are using Backpack from 37signals. Like wiki, but a little easier for non-tech users. Together with Basecamp for project management it's quite good, however it's again commercial and may not suit everybody.
ADD COMMENTlink written 4.0 years ago by Yuri1.1k
7
gravatar for Fred Fleche
4.0 years ago by
Fred Fleche3.7k
Paris, France
Fred Fleche3.7k wrote:
  • For analyzed results, documentation, presentation (pdf, ppt, doc, xls) we are using eRoom provided by provided by EMC Corporation
  • For experimental results it is a mix of your bullet 1 and 2 : There is a clearly defined hierarchy but after a while it all relays on the common knowledge to retrieve information when you want it in urgence.
  • Some groups are experimenting ELN providing by CambridgeSoft
  • We are also trying to create small social databases (ie an antibody database where people can share / retrieve their Western Blot experiments in order to avoid different people to test the same antibodies - Theye are able to "rate" the antibodies tested.)
ADD COMMENTlink written 4.0 years ago by Fred Fleche3.7k

can EMC be accessed programmatically or via the command line?

ADD REPLYlink written 4.0 years ago by Jeremy Leipzig12k

I don't know. Up to now I always used a web browser.

ADD REPLYlink written 4.0 years ago by Fred Fleche3.7k
7
gravatar for M.Eckart
3.4 years ago by
M.Eckart80
Germany
M.Eckart80 wrote:

We're trying to set up the bioinformatic tool epos right now. This tool is or free and made by Thasso Griebel at our university with some help of us. Itself say it's a modular software framework for phylogenetic analysis and visualization.

The MySQL Version is'nt finished yet, but the program is cool and exactly what we needed to sort our stuff.

http://bio.informatik.uni-jena.de/epos/

ADD COMMENTlink written 3.4 years ago by M.Eckart80

It looks like Epos is a system for managing phylogenetic data and analyses only. I think Pierre is asking a general question about how to manage different types of bioinformatics projects. Is Epos extensible to other use cases besides phylogenetics?

ADD REPLYlink written 3.4 years ago by Casey Bergman15k

It's at first written for phylogenetic analyses - for different types of projects there are powerful interfeaces to other programs too. And of course you canwrite your own, but this unfortunatly is mostly not what a normal user want to. I think it's worth to take a look at cause it's new and not well known already. But it comes around with cool stuff like an own script editor and managing of cluster analysis. We were impressed, but as you mentioned, only for our phylogenetic analyses... So thanks for your comment.

ADD REPLYlink written 3.4 years ago by M.Eckart80
6
gravatar for Niek De Klein
3.6 years ago by
Niek De Klein2.0k
Netherlands
Niek De Klein2.0k wrote:

Dropbox (http://www.dropbox.com/) is a nice way of keeping all files synchronized. You can invite computers to it and if you drop a file in it, it will automatically be updated on all invited computers.

ADD COMMENTlink written 3.6 years ago by Niek De Klein2.0k
1

...as well as being a great way of sharing files with another person

ADD REPLYlink written 3.6 years ago by Yannick Wurm1.8k

keep a look at Sparkleshare, http://sparkleshare.org/documentation.html , an open source alternative that allows to use your space on github or gitorious to share files and also uses git to do the versioning of the files.

ADD REPLYlink written 3.6 years ago by Giovanni M Dall'Olio18k

sorry, the correct link is http://sparkleshare.org/

ADD REPLYlink written 3.6 years ago by Giovanni M Dall'Olio18k
5
gravatar for Khader Shameer
4.0 years ago by
Rochester, MN
Khader Shameer14k wrote:

I have following major directories:

/work - this is where I keep my project directories

/data - raw, unprocessed data

/software - 3rd party software required for the various work flows

/code - general code repo

I update the individual work directory as follows

/work
   | 
   /work/project1 
          |
          create sub-directory based on analysis

For example code, analysis, results etc. Depending up on the repetitive nature of analysis, I create date based directories to track files generated at different time points. I also keep a README files with directories in project file to make it easier to check the content at a later stage. Big fan of "tree" whenever I need to check the contents in a directories. Irrespective of various data categories I deal with, this format worked for me.

ADD COMMENTlink written 4.0 years ago by Khader Shameer14k
4
gravatar for Ketil
3.5 years ago by
Ketil3.3k
Ketil3.3k wrote:

We use to have a slightly ad-hoc hierarchy, split on organism and then data type. This causes conflicts and generally doesn't scale (I guess most people working on this are familiar with http://www.shirky.com/writings/ontology_overrated.html ?)

My goal is to have defined datasets (i.e. collection of related files) with each dataset residing in its own subdirectory. In addition to the data files, there will be a metadata file containing relevant metadata - the list of files, their types, checksums, the program or process that generated them, person responsible, and the relationships between datasets.

This way, applications like our BLAST server will be able to trawl the data sets, identify those containing Fasta-files, and add them to the database with correct type and metainformation.

ADD COMMENTlink written 3.5 years ago by Ketil3.3k

Some more details at http://blog.malde.org/index.php/2011/06/02/presentation-on-the-lack-of-data-management-practices/

ADD REPLYlink written 2.9 years ago by Ketil3.3k
4
gravatar for Chris Evelo
3.2 years ago by
Chris Evelo8.8k
Maastricht, The Netherlands
Chris Evelo8.8k wrote:

A lot of the answers given here are very useful. But basically I would advocate another approach. Central to any wetlab study is the study description itself. We want to capture that design itself, including the sample and assay description using a Generic Study Capture Framework that is part of the systems biology database [?]dbNP[?]. In this we follow the same philosophy as the generic study description standard ISA-tab does; (reading and writing ISA-tab will be added to dbNP soon), the study description links to the actual raw, clean, statistically evaluated and biologically interpreted data. In this way you don't really have to structure where the files are since you can just find them from GSCF. GSCF is currently under development as part of the open source project dbnp.

Two papers about dbNP were published [?]here[?] and [?]here[?].

Of course the file location storage is just a small aspect of GSCF. It is mainly about ontology based study capturing using NCBO ontologies and queries based on that. It should also facilitate data submission to EBI and NCBI repositories.

ADD COMMENTlink written 3.2 years ago by Chris Evelo8.8k
2
gravatar for Radhouane Aniba
2.8 years ago by
Radhouane Aniba720 wrote:

I am personnaly using biocoders.net, I create a private group where I can upload my documents, papers, snippets and codes, using my group calendar to schedule my daily plans etc ...

ADD COMMENTlink written 2.8 years ago by Radhouane Aniba720
2
gravatar for Ying W
2.8 years ago by
Ying W1.8k
Los Angeles
Ying W1.8k wrote:

Since someone revived this thread I figure I should add this in.

When organizing your files it is also important to keep reproduciblity in mind. For R there is a package called sweave that is useful for this, alternatives also exist for other languages:

Doing this might be useful for organizing the results/ and src/ directories

ADD COMMENTlink modified 2.1 years ago • written 2.8 years ago by Ying W1.8k
2
gravatar for Faheemmitha
2.6 years ago by
Faheemmitha160
Faheemmitha160 wrote:

This project, currently called bixfile is designed as a web based management system, and is (I think) at least tangentially related to your question. This application lets the user upload files via a web interface. The location of all files and folders are stored in a database, and annotation is possible.

ADD COMMENTlink written 2.6 years ago by Faheemmitha160
Please log in to add an answer.

Help
Access
  • RSS
  • Stats
  • API

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.0.0
Traffic: 568 users visited in the last hour