My company is developing guidelines for employees to document their work so that in the event of an external audit there is a clear trail of what was done. So, just as bench scientists fill in their lab notebooks, we bioinformatics folks are expected to record our work in a "notebook". The question is: what is exactly worth documenting? Should every minor change in a script be recorded? Or should a written summary of what tasks a given program accomplishes suffice? I would like to hear about your experiences in documenting your programming/data analysis work. Thanks, Anjan
That is a very good question. I also frequently have the feeling that there is no parallel to "lab notebooks" in the field of bioinformatics. Namely, the linear/chronological property of lab notebooks is something that can be difficult to trace in bioinformatics.
There are of course, version control systems such as git, svn or cvs which are mainly used to keep track of modifications in code-producing work, but can also be used to control the versions of a paper or more. I don't know of anyone using this to trace bigger projects, though.
So for all of my projects, I usually have a README file at the root of the project which states the goals and the main steps for the project. Then, I usually prepare a .bash script where I record all the steps as I develop the project. This script is heavily commented and some results might also be recorded here for further reference. I never run this script as such, but it contains all the steps and parameters I used for reproducibility purposes. However, the chronology of events is lost here, so I do not know exactly when I decided to add 'option x' to a program call.
When a project gets bigger, things get more complicated, and I don't have a very defined system to keep track of the chronology.
So I use:
- version control for code and articles
- README file to keep track of main project objectives
- heavily commented .bash scripts to record analyses steps
We have multiple ISO certifications at work and these do require that we adhere to certain systems for auditing purposes.
Although I host my own notes in a personal wiki (version controlled) and we have svn or git repositories for our code development work, we also have separate wiki's for changes to underlying hardware systems or software configurations.
We also make use of CERF (http://elntech.com/cerf-software/) which is a proper ELN for full accountability and more formalised planning.
Our audits tend to focus on whether we are adhering to the defined processes. The technology underlying them is less important than making sure that whatever you use to do it, you actually do it.
This is a good question and has been asked before: how-do-you-log-details-of-data-processing-pipelines-in-silico-analyses-performed
There is a handy little write-up on the topic in PLoS Computational Biology: Noble WS (2009) A Quick Guide to Organizing Computational Biology Projects
I use a variety of methods and try to document things in the typical scientific way such that anyone stumbling upon my work can know what I did, why I did it, what the results were, and how to repeat it. I use a directory-based approach, such that projects are typically contained in a directory (no web server required). I use conventions such that every directory will always have certain files with certain names (i.e. an analysis document describing the project, or a README text file with various commands).
Having a private wiki accessible to yourself + collaborators makes wonders for mutual understanding what was done, when and how.
You need to document input data, including database versions, probably also when a given file was obtained. It will not hurt to have md5 hashes for anything exchanged/used (got burned few times by not so reliable file transfers).
All the data processing steps also need to be recorded. First in the wiki, then preferably automated as a script/make file.
As for the scripts, as long as you have some version control system (git, svn) and bother to put comments for changes there is no need of duplicating it.
In summary: record everything in such detail that a guy across the ocean will be able to repeat it without pestering you with emails/phones.
I think the phrases reproducible research and literate programming are critical here, since they go hand-in-hand with good programming practices in bioinformatics.
reproducible research means that anyone can reproduce your analyses without undue effort. It should be a requirement for publishing.
literate programming means that your code is literally (no pun intended) mixed in with language that describes its output that is visible in the report. This differs from commenting and documentation in that you have bound the end result with the code that produced it, which makes it easy for people to understand how you arrived at that result. Some languages are better at enabling literate programming than others.
Our code is hosted and shared using SVN repositories (including ticketing for anything on the production server).
For projects we document what we did as well as most important results in evernote. In there we paste any results, tables, figures and small scripts as attachments. For patenting issues we make regular prints which we glue (yes you read it right :-( ) into a classical notebook. Evernote allows the sharing and TAGging of notes and they can be quite nicely organised. In addition its crossplatform so even on my phone I have access to all my notes wherever I can use my phone.
For larger workflows we have decided to go by the galaxy way (get it here). It keeps track of most of what we did for NGS and our own code is usually written such that with minimal effort we can plug it into that system.
I use a version control software, usually hg, to record the changes to scripts and datasets. It's a bit silly at the beginning, and when you get used to it, it helps you to code better, because you are forced to provide a description of what you are doing in the commit message.
Each command executed on the command line is usually automatized in a Makefile or a Rakefile. See this tutorial on Software Carpentry: http://software-carpentry.org/4_0/make/intro/
For documentation and To-Dos, lately I have been using this software called trello. It is nice and more intuitive than a issue tracker software.
@LeonorPalmeira gave an excellent answer. The only thing I would recommend adding to hers is to keep a blog where you occasionally take time to (more formally) document progress made. I typically use text files (such as READMEs) or my private wiki to keep track of hour-to-hour and day-to-day developments in my projects. But occasionally I come across a problem that requires me to do some research. Investing some time to formulate an elegant solution to a new problem is rewarding, but I never feel complete until I take a moment to write a blog post about it. This forces me to think about the problem in a more general context (how is it applicable to other scientists) and to describe the problem in clear (and if possible in layman's) terms. Not only does this provided a great complement to the more mundane documentation I record using READMEs and my wiki, but it also provides a starting point for me when I begin preparing manuscripts for publication.
For day to day personal work, I have for many years used tiddlywiki since it is very fast to edit and administer. To edit, one only has to double click, and then to save is ctrl-enter (cf a mediawiki which requires you loading an entirely new page to edit). Also, the entire wiki and your entries are kept inside a single HTML file - there's no database backend. This means it can itself be kept under version control, transferred between work and home, kept on dropbox etc.
Given its ease, I've found that less computationally trained biologists don't have much trouble using it after initial setup.