Question: How Do You Manage Your Files & Directories For Your Projects ?
gravatar for Pierre Lindenbaum
7.1 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum94k wrote:

People in a laboratory, working one the same project, generate all kind of files (fasta, images, raw data, statistics, readme.txt, etc...) that will be moved in some directories. How do you manage the hierarchy of those directories ?

  • there is no standard hierarchy and the file are dropped anywhere. It all relays on the common knowledge.
  • there is a clearly defined hierarchy (PROJECT_NAME/DATE/machine_user_image_result_but_this_is_the_second_run_because_1st_failed.txt...)
  • files are uploaded on a wiki (you wouldn't do that for large files)
  • there is a central file/wiki answering what/where is a file
  • there is a Readme.txt/describe.xml in each folder.
  • there is a tool (?) for managing this kind of information ?
  • (...) ?



file • 18k views
ADD COMMENTlink written 7.1 years ago by Pierre Lindenbaum94k

currently, it's a mess :-)

ADD REPLYlink written 6.5 years ago by Pierre Lindenbaum94k

Pierre: I am wondering how you are managing your files & directories ?

ADD REPLYlink written 6.5 years ago by Khader Shameer17k

Why do you need files anyway? Files are from the 70's, not going to scale these days.

ADD REPLYlink written 5.8 years ago by michaelhavner0
gravatar for Giovanni M Dall'Olio
7.1 years ago by
London, UK
Giovanni M Dall'Olio25k wrote:

In my local computer, I have:

  • a 'workspace' folder, in which each sub-folder correspond to a separate project
  • a 'data' folder where I put all the data used by more than a project
  • an 'archive' folder with all finished project

Within each project folder, I have:

  • planning/ -> a folder containing all the files related to the early phase of the project. Usually this is the first folder I create, and here I store all the miscellaneous files (the notes/objectives/initial drafts) that I collect in the first weeks of a projects, when I still am not sure which programs to write.
  • bugs/ -> I used to use ditz to keep track of bugs and To-Dos, but now I use only A7 hand-written papers
  • data/
    • folders containing the different data I need to use, soft-linked from ~/data
  • parameters/ -> ideally, I should have configuration files so if I want to run my analysis on other dataset, I only have to change the parameters here
  • src/ -> with all code
    • a Makefile to re-run all the analysis I wish
    • scripts/ with all the scripts
    • lib/ eventually, if I am reusing code from other projects
    • pipelines/ with all .mk (makefile) files
  • results/
    • tables/ -> tabular-like results
    • plots/ -> plots
    • manuscript/ -> draft for the manuscript, final figures and data, etc..
      • figures/
      • tables/
      • references/

I use git for revision control, to get a log of all the changes I make to scripts and results. Lately I have been reading about sumatra, and planning to give it a try (a slideshow for the curious here)

I still have to decide very well where to put .Rdata files, as I am still a novice to R.

note: you probably know already the article by Plos "A Quick Guide to Organizing Computational Biology Projects"

ADD COMMENTlink modified 5.7 years ago • written 7.1 years ago by Giovanni M Dall'Olio25k

I put RData files with my "code" files. (so I load them when I open relevant code files)

ADD REPLYlink written 7.1 years ago by Tal Galili120

I usually keep RData files in a "data" directory and use setwd() in my R script to point to it. Doesn't really matter how you do it, so long as source() finds the R script and the R script finds the data.

ADD REPLYlink written 7.1 years ago by Neilfws47k

thanks!! That it is more or less what I am doing now, but I am not sure whether I should create a separate 'Rdata' directory.

ADD REPLYlink written 7.1 years ago by Giovanni M Dall'Olio25k

I only store useful or large objects (like huge matrices from acgh) in .RData files that will then take less disk space. So, sometimes .Rdata files replace original text files in my 'data' folder. This was just a remark...

ADD REPLYlink written 7.1 years ago by toni2.1k

I asked this question 8 months. It's time to validate the most voted answer :-)

ADD REPLYlink written 6.4 years ago by Pierre Lindenbaum94k

thank you very much :-)

ADD REPLYlink written 6.4 years ago by Giovanni M Dall'Olio25k
gravatar for Jeremy Leipzig
7.1 years ago by
Philadelphia, PA
Jeremy Leipzig16k wrote:

one related tip is a handy bash script I got from:

which produces directory-specific bash histories (instead of one giant global history)

whenever I enter a directory I can easily access everything I ever did there, which is priceless when I am trying to remember what I actually did

ADD COMMENTlink written 7.1 years ago by Jeremy Leipzig16k

a version that allows multiple users to view each others histories:

ADD REPLYlink modified 3.8 years ago • written 6.4 years ago by Jeremy Leipzig16k
gravatar for Casbon
6.6 years ago by
Casbon3.1k wrote:

I follow this data plan

ADD COMMENTlink written 6.6 years ago by Casbon3.1k

Fantastic! Thanks for sharing!

ADD REPLYlink written 6.5 years ago by None90

Excellent - made me laugh! Or was that cry.

ADD REPLYlink written 6.3 years ago by Niallhaslam2.2k
gravatar for Rvosa
6.4 years ago by
Leiden, the Netherlands
Rvosa560 wrote:

The following article discusses this exact question, and gives useful tips (which I now follow):

William Stafford Noble 2009. A Quick Guide to Organizing Computational Biology Projects. PLoS Comput Biol 5(7): e1000424.

ADD COMMENTlink written 6.4 years ago by Rvosa560
gravatar for Yuri
7.1 years ago by
Bethesda, MD
Yuri1.4k wrote:

Great question, thanks!

In my opinion there are several layers of files, and different approaches should be applied on each level. Here how it's organized in our lab.

  1. Raw data (microarrays, for example)
    • Files are named and stored in clearly defined schema
    • Regular backup is mandatory
    • Some files, which we probably will not use further (like Affymetrix DAT files) are archived.
    • Access to the files is controlled
    • General information on experiments is stored in LIMS (we are using Labmatrix, but it's commercial)
    • We also store some preprocessed data (normalization, for example) if the procedure is clearly defined as SOP.
  2. Temporary data (ongoing analysis)
    • Basically everybody are on their-own here. The files are usually stored locally and everyone responsible for their backup. I can access the data I need remotely (from home, for example).
    • I do keep some hierarchy based on projects, data type and analysis, but it's not strict and project-dependent.
    • I found Total Commander to be very useful for files management. For example, I can write a small comments for every file (Ctrl-Z), it's stored in a text file, and if I copy or move some file, the description goes with it.
    • Files to share for project team we keep on network shared drive with regular backup.
  3. Results to share (documents, figures, tables, ...)
    • We are using Backpack from 37signals. Like wiki, but a little easier for non-tech users. Together with Basecamp for project management it's quite good, however it's again commercial and may not suit everybody.
ADD COMMENTlink written 7.1 years ago by Yuri1.4k
gravatar for Fred Fleche
7.1 years ago by
Fred Fleche4.1k
Paris, France
Fred Fleche4.1k wrote:
  • For analyzed results, documentation, presentation (pdf, ppt, doc, xls) we are using eRoom provided by provided by EMC Corporation
  • For experimental results it is a mix of your bullet 1 and 2 : There is a clearly defined hierarchy but after a while it all relays on the common knowledge to retrieve information when you want it in urgence.
  • Some groups are experimenting ELN providing by CambridgeSoft
  • We are also trying to create small social databases (ie an antibody database where people can share / retrieve their Western Blot experiments in order to avoid different people to test the same antibodies - Theye are able to "rate" the antibodies tested.)
ADD COMMENTlink written 7.1 years ago by Fred Fleche4.1k

can EMC be accessed programmatically or via the command line?

ADD REPLYlink written 7.1 years ago by Jeremy Leipzig16k

I don't know. Up to now I always used a web browser.

ADD REPLYlink written 7.1 years ago by Fred Fleche4.1k
gravatar for M.Eckart
6.5 years ago by
M.Eckart80 wrote:

We're trying to set up the bioinformatic tool epos right now. This tool is or free and made by Thasso Griebel at our university with some help of us. Itself say it's a modular software framework for phylogenetic analysis and visualization.

The MySQL Version is'nt finished yet, but the program is cool and exactly what we needed to sort our stuff.

ADD COMMENTlink written 6.5 years ago by M.Eckart80

It looks like Epos is a system for managing phylogenetic data and analyses only. I think Pierre is asking a general question about how to manage different types of bioinformatics projects. Is Epos extensible to other use cases besides phylogenetics?

ADD REPLYlink written 6.5 years ago by Casey Bergman17k

It's at first written for phylogenetic analyses - for different types of projects there are powerful interfeaces to other programs too. And of course you canwrite your own, but this unfortunatly is mostly not what a normal user want to. I think it's worth to take a look at cause it's new and not well known already. But it comes around with cool stuff like an own script editor and managing of cluster analysis. We were impressed, but as you mentioned, only for our phylogenetic analyses... So thanks for your comment.

ADD REPLYlink written 6.5 years ago by M.Eckart80
gravatar for Niek De Klein
6.7 years ago by
Niek De Klein2.3k
Niek De Klein2.3k wrote:

Dropbox ( is a nice way of keeping all files synchronized. You can invite computers to it and if you drop a file in it, it will automatically be updated on all invited computers.

ADD COMMENTlink written 6.7 years ago by Niek De Klein2.3k
1 well as being a great way of sharing files with another person

ADD REPLYlink written 6.7 years ago by Yannick Wurm2.2k

keep a look at Sparkleshare, , an open source alternative that allows to use your space on github or gitorious to share files and also uses git to do the versioning of the files.

ADD REPLYlink written 6.7 years ago by Giovanni M Dall'Olio25k

sorry, the correct link is

ADD REPLYlink written 6.7 years ago by Giovanni M Dall'Olio25k
gravatar for Khader Shameer
7.1 years ago by
Manhattan, NY
Khader Shameer17k wrote:

I have following major directories:

/work - this is where I keep my project directories

/data - raw, unprocessed data

/software - 3rd party software required for the various work flows

/code - general code repo

I update the individual work directory as follows

          create sub-directory based on analysis

For example code, analysis, results etc. Depending up on the repetitive nature of analysis, I create date based directories to track files generated at different time points. I also keep a README files with directories in project file to make it easier to check the content at a later stage. Big fan of "tree" whenever I need to check the contents in a directories. Irrespective of various data categories I deal with, this format worked for me.

ADD COMMENTlink written 7.1 years ago by Khader Shameer17k
gravatar for Ketil
6.6 years ago by
Ketil3.8k wrote:

We use to have a slightly ad-hoc hierarchy, split on organism and then data type. This causes conflicts and generally doesn't scale (I guess most people working on this are familiar with ?)

My goal is to have defined datasets (i.e. collection of related files) with each dataset residing in its own subdirectory. In addition to the data files, there will be a metadata file containing relevant metadata - the list of files, their types, checksums, the program or process that generated them, person responsible, and the relationships between datasets.

This way, applications like our BLAST server will be able to trawl the data sets, identify those containing Fasta-files, and add them to the database with correct type and metainformation.

ADD COMMENTlink written 6.6 years ago by Ketil3.8k

Some more details at

ADD REPLYlink written 6.0 years ago by Ketil3.8k
gravatar for Chris Evelo
6.2 years ago by
Chris Evelo9.9k
Maastricht, The Netherlands
Chris Evelo9.9k wrote:

A lot of the answers given here are very useful. But basically I would advocate another approach. Central to any wetlab study is the study description itself. We want to capture that design itself, including the sample and assay description using a Generic Study Capture Framework that is part of the systems biology database [?]dbNP[?]. In this we follow the same philosophy as the generic study description standard ISA-tab does; (reading and writing ISA-tab will be added to dbNP soon), the study description links to the actual raw, clean, statistically evaluated and biologically interpreted data. In this way you don't really have to structure where the files are since you can just find them from GSCF. GSCF is currently under development as part of the open source project dbnp.

Two papers about dbNP were published [?]here[?] and [?]here[?].

Of course the file location storage is just a small aspect of GSCF. It is mainly about ontology based study capturing using NCBO ontologies and queries based on that. It should also facilitate data submission to EBI and NCBI repositories.

ADD COMMENTlink written 6.2 years ago by Chris Evelo9.9k
gravatar for Radhouane Aniba
5.9 years ago by
Radhouane Aniba740 wrote:

I am personnaly using, I create a private group where I can upload my documents, papers, snippets and codes, using my group calendar to schedule my daily plans etc ...

ADD COMMENTlink written 5.9 years ago by Radhouane Aniba740
gravatar for Ying W
5.9 years ago by
Ying W3.5k
South San Francisco, CA
Ying W3.5k wrote:

Since someone revived this thread I figure I should add this in.

When organizing your files it is also important to keep reproduciblity in mind. For R there is a package called sweave that is useful for this, alternatives also exist for other languages:

Doing this might be useful for organizing the results/ and src/ directories

ADD COMMENTlink modified 5.2 years ago • written 5.9 years ago by Ying W3.5k
gravatar for Faheemmitha
5.7 years ago by
Faheemmitha190 wrote:

This project, currently called bixfile is designed as a web based management system, and is (I think) at least tangentially related to your question. This application lets the user upload files via a web interface. The location of all files and folders are stored in a database, and annotation is possible.

ADD COMMENTlink written 5.7 years ago by Faheemmitha190
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 447 users visited in the last hour