Question

How Do You Log Details Of Data Processing/ Pipelines / In Silico Analyses Performed

13

Entering edit mode

12.9 years ago

Pi ▴ 520

Hello

Is there an online log book which can be used by a team of people analysing a data set to store details of the pipelines performed (commands, software version) on a set of data. I forever seem to be asking colleagues of how they generated the data they hand me and it seems more than prudent to be logging this somewhere in a centralised resource. There is a wiki but I think the informality of this resource means people aren't strict about its use.

thank you for your time

edit: i should point out that the members of this team are very dispersed geographical so a notebook is out of the question and even access to a shared machine space is unlikely as we are all from different institutions. I was thinking of an online log more formal than a wiki.

data galaxy • 7.2k views

ADD COMMENT • link written 12.9 years ago by Pi ▴ 520

Ram · Answer 1 · 2011-05-21

If you look around most wet labs, you'll see lab notebooks. These document what was done, and often contain the results if the experiment's result was memorialized as a photo (of a blot, for example). Individual researchers might vary in their diligence, but writing down what you did is part of doing science. Some groups use electronic notebooks, but most academic scientists I know use paper. Bioinformatics protocols should follow the same rules. The notes will probably be electronic, but they still have to be there.

I prefer to write pipeline scripts and keep them with the finished data. Every microarray or genotype dataset I handle has a CREATE.R script that takes raw data and produces the finished pipeline product. I can hand someone the raw data and CREATE.R and they'll get my output. In theory I could bank my scripts somewhere central, but so long as the process is documented somewhere I'm covered. See also: biostar on reproducible research, and sites on the web.

If you don't know how the data were transformed, you will get bitten by it. In my experience, the more systematic and hard-core you are about it, the more dividends it pays over time.

See also:

score 4 · Answer 2 · 2011-05-21

4

Entering edit mode

12.9 years ago

Jan Kosinski ★ 1.6k

From my personal experience what I found working best is a GENERAL notebook tool that gives you complete freedom of the syntax and form of your notes.

I use MoinMoin wiki for notes with code (because it gives syntax highlighting etc.), and Evernote for other notes, TODO lists, and labbook that lists what I've done every day. For both tools there is plenty of ways this data can be shared.

In my opinion, imposing any strict conventions is bad idea as everybody, especially bioinformaticians, prefers his owns style of notes and remembering things. I think that the success of typical wet lab notebook is in that you get a blank space and tools to whatever you want with this blank space.

ADD COMMENT • link 12.9 years ago by Jan Kosinski ★ 1.6k

0

Entering edit mode

Strict is a loose term. A convention could be as simple as "document what you did" so that a reasonable person doesn't have to spend valuable time doing forensic bioinformatics to reverse engineer a directory full of results just to make sure they understand how they were derived. Most bioinformaticians I work with document nothing. If you're working as part of a group, conventions are important. Exactly how they are spelled out is less important than doing nothing and leaving problems. If people can't understand what you did, it doesn't count. (BTW, I too use MoinMoin)

ADD REPLY • link 12.9 years ago by seidel 11k

0

Entering edit mode

You're right, all should document what they did, for their own good, to re-use their own protocols. Documenting everything for others would take too much time, though. And you cannot know which of your protocols will be any useful for others. As solution, other people from your group should more or less know what you do and what tools you use, and then if they need sth, they ask, and if you can help, you send them the protocol adding the explanations.

ADD REPLY • link 12.9 years ago by Jan Kosinski ★ 1.6k

score 3 · Answer 3 · 2011-05-21

I tend to rely on a mixture of conventions and tools. Coming from a wet lab environment, and having an interest in the history of intellectuals, I have a fondness for notebooks. Simple and robust. But useless in an electronic environment. However the idea of notebooks is still essential, and can be implemented in the file system with a few conventions to keep it from being too dependent on any one kind of implementation (i.e. like an online log book). Styles will vary from person to person. For me, every project gets a directory with a short systematic ID. I define projects as a "work unit of science" - a question to be addressed by a group of samples, or an array design project, etc., but granularity can be a little tough to judge sometimes. The directories are organized by Investigator/projectsponsor/projectid. Within a project directory one can expect to find files with specific names and purposes (by convention, like David's CREATE.R script - specific name, specific purpose). I bring a lot of questions to a data set, so it's hard for me to have single scripts for everything. Rather I engage in an endless repeating litany of question, method, answer. So I usually have a mixed document of text and pictures, and a companion document of code bits (chunks of R that can be executed for a result). For pipeline details, one-liners, etc., I keep a text file which is like a journal or notebook page documenting and explaining scripts, results, versions. If multiple analysts work on the same project, they create their own subdirectory with a defined name and their initials, and all the document conventions apply. Then I usually add a manifest document to explain the purpose of the sub-directories.

With defined conventions, and the filesystem, you have the equivalent of an electronic notebook (which can also be mostly platform independent). You just have to decide on good conventions, and then adhere to them.

score 3 · Answer 4 · 2011-05-21

3

Entering edit mode

12.9 years ago

Casey Bergman 18k

In principle, this problem can be solved using the history functionality of the Galaxy framework. For tasks that the main Galaxy site can handle, you can simply perform your analysis pipeline there and the history allows you record and share your entire process, and to reproduce it by exporting as a workflow. Alternatively, you can install and customize a local version of Galaxy that includes your own applications but leverages the same history/workflow functionality for reproducibity.

ADD COMMENT • link 12.9 years ago by Casey Bergman 18k

0

Entering edit mode

i have edited the question. This does seem like a possibility.It looks like you can do things like BWA on galaxy. If galaxy doesn't have a feature you want (VCFTools, plink/seq) i guess you can add them as tools?

ADD REPLY • link 12.9 years ago by Pi ▴ 520

0

Entering edit mode

yes, you can easily add any script or application that conforms to Galaxy data types, and I believe you can extend supported data types as well, in a local installation. You can also take advantage of user contributed tools in the Galaxy tool shed: http://community.g2.bx.psu.edu/ Getting new functions into the main Galaxy site is up to the Galaxy Team.

ADD REPLY • link 12.9 years ago by Casey Bergman 18k

0

Entering edit mode

oh that's very good then. When i did the tutorials i remember that workflows can be shared between users. It would be very good to share datasets too? That would be perfect to have workflows with named datasets showing an explicit history of how they were created. They have added a lot more to galaxy since i looked at it

ADD REPLY • link 12.9 years ago by Pi ▴ 520

score 2 · Answer 5 · 2011-07-27

I have found that makefiles (yes, the ones you use for compiling software) are a powerful way to at the same time automate and document what you do. The basic syntax of makefiles in my mind fits very well with what you would like to capture for documentation purposes:

the_file_you_made: an_input_file another_input_file
        some_command --important-option $< > $@

In other words you write down which files you make, from which other files, by running which commands. Best of all, you write it down in a computer-readable from, which allows you to easily rerun everything that needs to rerun if you discover a mistake in some step.

Better yet, the wildcard options allows you to automate all the bread-and-butter bioinformatics tasks:

# Convert Genbank file to FASTA format.
%.fasta: %.gbk
        gbk2fasta $< > $@

I find that the latter way of using makefiles have several desirable effects. Firstly, it gives people a motivation to actually use makefiles (and as a nice side effect document what they did). Secondly, such rules rely heavily on how you name files, which encourages people to standardize how they name files.

I have used this on many projects over the years, starting back when I did my M.Sc. thesis. A large part of the project was to analyze the S. cerevisiae genome (which still quite new at the time). Imagine downloading a new, improved assembly and annotation of the genome in Genbank format, typing make, and have every analysis redone based on your "lab notebook". That was what I did just one month before handing in my report :-)

score 0 · Answer 6 · 2011-05-22

0

Entering edit mode

12.9 years ago

Nir London ▴ 220

Don't you just >& log and email it ?

ADD COMMENT • link 12.9 years ago by Nir London ▴ 220