Question: How Does A Bioinformatics Scientist Document His/Her Work?
gravatar for Anjan
7.5 years ago by
United States
Anjan810 wrote:

My company is developing guidelines for employees to document their work so that in the event of an external audit there is a clear trail of what was done. So, just as bench scientists fill in their lab notebooks, we bioinformatics folks are expected to record our work in a "notebook". The question is: what is exactly worth documenting? Should every minor change in a script be recorded? Or should a written summary of what tasks a given program accomplishes suffice? I would like to hear about your experiences in documenting your programming/data analysis work. Thanks, Anjan

bioinformatics workflow • 12k views
ADD COMMENTlink written 7.5 years ago by Anjan810
gravatar for Leonor Palmeira
7.5 years ago by
Leonor Palmeira3.7k
Liège, Belgium
Leonor Palmeira3.7k wrote:

That is a very good question. I also frequently have the feeling that there is no parallel to "lab notebooks" in the field of bioinformatics. Namely, the linear/chronological property of lab notebooks is something that can be difficult to trace in bioinformatics.

There are of course, version control systems such as git, svn or cvs which are mainly used to keep track of modifications in code-producing work, but can also be used to control the versions of a paper or more. I don't know of anyone using this to trace bigger projects, though.

So for all of my projects, I usually have a README file at the root of the project which states the goals and the main steps for the project. Then, I usually prepare a .bash script where I record all the steps as I develop the project. This script is heavily commented and some results might also be recorded here for further reference. I never run this script as such, but it contains all the steps and parameters I used for reproducibility purposes. However, the chronology of events is lost here, so I do not know exactly when I decided to add 'option x' to a program call.

When a project gets bigger, things get more complicated, and I don't have a very defined system to keep track of the chronology.

So I use:

  • version control for code and articles
  • README file to keep track of main project objectives
  • heavily commented .bash scripts to record analyses steps
ADD COMMENTlink written 7.5 years ago by Leonor Palmeira3.7k

To keep track of the chronology of the bash script / README, the "obvious" solution would be to put them under version control as well. Even if you don't keep good commit messages, it'll help to track down "when did I decide to change parameter X?"

I'm still working out my methodology, but when in doubt I make a fresh repository for every project and check everything I write into it.

ADD REPLYlink written 7.5 years ago by Fwip490

+1 you're right. I should start using a version control system for the bash script and README files :-)

ADD REPLYlink written 7.5 years ago by Leonor Palmeira3.7k

about the chronology, you can add "map de !!date<cr>" in your .vimrc. Then you can type d e in the vi to insert the data.

ADD REPLYlink written 7.5 years ago by Zhilong Jia1.5k
gravatar for Daniel Swan
7.5 years ago by
Daniel Swan13k
Aberdeen, UK
Daniel Swan13k wrote:

We have multiple ISO certifications at work and these do require that we adhere to certain systems for auditing purposes.

Although I host my own notes in a personal wiki (version controlled) and we have svn or git repositories for our code development work, we also have separate wiki's for changes to underlying hardware systems or software configurations.

We also make use of CERF ( which is a proper ELN for full accountability and more formalised planning.

Our audits tend to focus on whether we are adhering to the defined processes. The technology underlying them is less important than making sure that whatever you use to do it, you actually do it.

ADD COMMENTlink written 7.5 years ago by Daniel Swan13k

"The technology underlying them is less important than making sure that whatever you use to do it, you actually do it."

+1. I can't say how many times I've had colleagues get excited about how such and such software/app (a new wiki, Google Wave, whatever) is going to make documentation so much easier. No matter how nice the documentation software is, it's useless if it's not used. Most documentation problems are people problems, not technical ones.

ADD REPLYlink written 7.5 years ago by Daniel Standage3.9k

+1 But therefore I think it is essential that the documentation should be as easy and flexible as possible and not too restrictive which ends up in more work and less compliance. We seen several electronic labjournals / lims based fail for microarrays in the past...

ADD REPLYlink modified 7.5 years ago • written 7.5 years ago by ALchEmiXt1.9k
gravatar for seidel
7.5 years ago by
United States
seidel6.8k wrote:

This is a good question and has been asked before: how-do-you-log-details-of-data-processing-pipelines-in-silico-analyses-performed

There is a handy little write-up on the topic in PLoS Computational Biology: Noble WS (2009) A Quick Guide to Organizing Computational Biology Projects

I use a variety of methods and try to document things in the typical scientific way such that anyone stumbling upon my work can know what I did, why I did it, what the results were, and how to repeat it. I use a directory-based approach, such that projects are typically contained in a directory (no web server required). I use conventions such that every directory will always have certain files with certain names (i.e. an analysis document describing the project, or a README text file with various commands).

ADD COMMENTlink modified 7.5 years ago • written 7.5 years ago by seidel6.8k
gravatar for Pierre Lindenbaum
7.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum124k wrote:

When I need to write my notes , I use our local wiki or

My code is hosted on Google (SVN) and Github (GIT).


ADD COMMENTlink written 7.5 years ago by Pierre Lindenbaum124k
gravatar for Darked89
7.5 years ago by
Barcelona, Spain
Darked894.2k wrote:

Having a private wiki accessible to yourself + collaborators makes wonders for mutual understanding what was done, when and how.

You need to document input data, including database versions, probably also when a given file was obtained. It will not hurt to have md5 hashes for anything exchanged/used (got burned few times by not so reliable file transfers).

All the data processing steps also need to be recorded. First in the wiki, then preferably automated as a script/make file.

As for the scripts, as long as you have some version control system (git, svn) and bother to put comments for changes there is no need of duplicating it.

In summary: record everything in such detail that a guy across the ocean will be able to repeat it without pestering you with emails/phones.

ADD COMMENTlink written 7.5 years ago by Darked894.2k

Yes, I did this for my lab and it is very popular.

ADD REPLYlink written 7.5 years ago by Burlappsack660
gravatar for Jeremy Leipzig
7.5 years ago by
Philadelphia, PA
Jeremy Leipzig18k wrote:

I think the phrases reproducible research and literate programming are critical here, since they go hand-in-hand with good programming practices in bioinformatics.

reproducible research means that anyone can reproduce your analyses without undue effort. It should be a requirement for publishing.

literate programming means that your code is literally (no pun intended) mixed in with language that describes its output that is visible in the report. This differs from commenting and documentation in that you have bound the end result with the code that produced it, which makes it easy for people to understand how you arrived at that result. Some languages are better at enabling literate programming than others.

ADD COMMENTlink modified 7.5 years ago • written 7.5 years ago by Jeremy Leipzig18k
gravatar for Paulo Nuin
7.5 years ago by
Paulo Nuin3.7k
Paulo Nuin3.7k wrote:

log files, bash history, README files, git (or any other version control system)

ADD COMMENTlink written 7.5 years ago by Paulo Nuin3.7k
gravatar for ALchEmiXt
7.5 years ago by
The Netherlands
ALchEmiXt1.9k wrote:

Our code is hosted and shared using SVN repositories (including ticketing for anything on the production server).

For projects we document what we did as well as most important results in evernote. In there we paste any results, tables, figures and small scripts as attachments. For patenting issues we make regular prints which we glue (yes you read it right :-( ) into a classical notebook. Evernote allows the sharing and TAGging of notes and they can be quite nicely organised. In addition its crossplatform so even on my phone I have access to all my notes wherever I can use my phone.

For larger workflows we have decided to go by the galaxy way (get it here). It keeps track of most of what we did for NGS and our own code is usually written such that with minimal effort we can plug it into that system.

ADD COMMENTlink written 7.5 years ago by ALchEmiXt1.9k
gravatar for Giovanni M Dall'Olio
7.5 years ago by
London, UK
Giovanni M Dall'Olio26k wrote:

I use a version control software, usually hg, to record the changes to scripts and datasets. It's a bit silly at the beginning, and when you get used to it, it helps you to code better, because you are forced to provide a description of what you are doing in the commit message.

Each command executed on the command line is usually automatized in a Makefile or a Rakefile. See this tutorial on Software Carpentry:

For documentation and To-Dos, lately I have been using this software called trello. It is nice and more intuitive than a issue tracker software.

ADD COMMENTlink written 7.5 years ago by Giovanni M Dall'Olio26k
gravatar for Daniel Standage
7.5 years ago by
Daniel Standage3.9k
Davis, California, USA
Daniel Standage3.9k wrote:

@LeonorPalmeira gave an excellent answer. The only thing I would recommend adding to hers is to keep a blog where you occasionally take time to (more formally) document progress made. I typically use text files (such as READMEs) or my private wiki to keep track of hour-to-hour and day-to-day developments in my projects. But occasionally I come across a problem that requires me to do some research. Investing some time to formulate an elegant solution to a new problem is rewarding, but I never feel complete until I take a moment to write a blog post about it. This forces me to think about the problem in a more general context (how is it applicable to other scientists) and to describe the problem in clear (and if possible in layman's) terms. Not only does this provided a great complement to the more mundane documentation I record using READMEs and my wiki, but it also provides a starting point for me when I begin preparing manuscripts for publication.

ADD COMMENTlink written 7.5 years ago by Daniel Standage3.9k
gravatar for benjwoodcroft
7.5 years ago by
benjwoodcroft110 wrote:

For day to day personal work, I have for many years used tiddlywiki since it is very fast to edit and administer. To edit, one only has to double click, and then to save is ctrl-enter (cf a mediawiki which requires you loading an entirely new page to edit). Also, the entire wiki and your entries are kept inside a single HTML file - there's no database backend. This means it can itself be kept under version control, transferred between work and home, kept on dropbox etc.

Given its ease, I've found that less computationally trained biologists don't have much trouble using it after initial setup.

ADD COMMENTlink written 7.5 years ago by benjwoodcroft110

+1 I love tiddlywiki - took me a while to move away from dokuwiki, but it's ideal for my version control/multiple computers mode of working.

ADD REPLYlink written 7.4 years ago by Daniel Swan13k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1696 users visited in the last hour