People in a laboratory, working one the same project, generate all kind of files (fasta, images, raw data, statistics, readme.txt, etc...) that will be moved in some directories. How do you manage the hierarchy of those directories ?
In my local computer, I have:
Within each project folder, I have:
I use git for revision control, to get a log of all the changes I make to scripts and results. Lately I have been reading about sumatra, and planning to give it a try (a slideshow for the curious here)
I still have to decide very well where to put .Rdata files, as I am still a novice to R.
note: you probably know already the article by Plos "A Quick Guide to Organizing Computational Biology Projects"
one related tip is a handy bash script I got from:
which produces directory-specific bash histories (instead of one giant global history)
whenever I enter a directory I can easily access everything I ever did there, which is priceless when I am trying to remember what I actually did
The following article discusses this exact question, and gives useful tips (which I now follow):
William Stafford Noble 2009. A Quick Guide to Organizing Computational Biology Projects. PLoS Comput Biol 5(7): e1000424. http://dx.doi.org/10.1371/journal.pcbi.1000424
Great question, thanks!
In my opinion there are several layers of files, and different approaches should be applied on each level. Here how it's organized in our lab.
We're trying to set up the bioinformatic tool epos right now. This tool is or free and made by Thasso Griebel at our university with some help of us. Itself say it's a modular software framework for phylogenetic analysis and visualization.
The MySQL Version is'nt finished yet, but the program is cool and exactly what we needed to sort our stuff.
Dropbox (http://www.dropbox.com/) is a nice way of keeping all files synchronized. You can invite computers to it and if you drop a file in it, it will automatically be updated on all invited computers.
I have following major directories:
/work - this is where I keep my project directories
/data - raw, unprocessed data
/software - 3rd party software required for the various work flows
/code - general code repo
I update the individual work directory as follows
/work | /work/project1 | create sub-directory based on analysis
For example code, analysis, results etc. Depending up on the repetitive nature of analysis, I create date based directories to track files generated at different time points. I also keep a README files with directories in project file to make it easier to check the content at a later stage. Big fan of "tree" whenever I need to check the contents in a directories. Irrespective of various data categories I deal with, this format worked for me.
We use to have a slightly ad-hoc hierarchy, split on organism and then data type. This causes conflicts and generally doesn't scale (I guess most people working on this are familiar with http://www.shirky.com/writings/ontology_overrated.html ?)
My goal is to have defined datasets (i.e. collection of related files) with each dataset residing in its own subdirectory. In addition to the data files, there will be a metadata file containing relevant metadata - the list of files, their types, checksums, the program or process that generated them, person responsible, and the relationships between datasets.
This way, applications like our BLAST server will be able to trawl the data sets, identify those containing Fasta-files, and add them to the database with correct type and metainformation.
A lot of the answers given here are very useful. But basically I would advocate another approach. Central to any wetlab study is the study description itself. We want to capture that design itself, including the sample and assay description using a Generic Study Capture Framework that is part of the systems biology database [?]dbNP[?]. In this we follow the same philosophy as the generic study description standard ISA-tab does; (reading and writing ISA-tab will be added to dbNP soon), the study description links to the actual raw, clean, statistically evaluated and biologically interpreted data. In this way you don't really have to structure where the files are since you can just find them from GSCF. GSCF is currently under development as part of the open source project dbnp.
Two papers about dbNP were published [?]here[?] and [?]here[?].
Of course the file location storage is just a small aspect of GSCF. It is mainly about ontology based study capturing using NCBO ontologies and queries based on that. It should also facilitate data submission to EBI and NCBI repositories.
Since someone revived this thread I figure I should add this in.
When organizing your files it is also important to keep reproduciblity in mind. For R there is a package called sweave that is useful for this, alternatives also exist for other languages:
Doing this might be useful for organizing the results/ and src/ directories
This project, currently called bixfile is designed as a web based management system, and is (I think) at least tangentially related to your question. This application lets the user upload files via a web interface. The location of all files and folders are stored in a database, and annotation is possible.