Github Workflow For Bioinformatics Analysis
5
9
Entering edit mode
11.1 years ago
Nick Loman ▴ 610

Let's say I have a Github repository 'bioinformatics_template' where I put extremely commonly used software, data and scripts. Some of the software is commonly used stuff like BWA and samtools. The scripts will mainly be my own stuff.

When I start a new analysis, I want to make a clone/fork/whatever of my repository. I then would like this repository to be a new repository (sometimes on Github, sometimes just stored locally). I may from time to time add new scripts or software that I would like to make it back to the original bioinformatics_template repository. Ideally I would also like the other projects to be easily update their basic template.

What is the correct workflow using Git? I am specifically confused as to the best way to employ Git features such as branches, tags, fork/pull models or a shared repository model. It feels a bit odd to be sending myself 'pull requests', but I also do not want a single shared repository.

A secondary question, I'd quite like to be able to specify that the repository sources things like BWA from their canonical repositories, ideally specifying the version used to help with reproducibility. What is the best way to achieve this?

• 7.8k views
ADD COMMENT
0
Entering edit mode

I would add a git tag too because the question is relevant for generic (non-Github) workflows. (I'm not authorized to edit questions.)

ADD REPLY
11
Entering edit mode
11.1 years ago
Gingi ▴ 330

For the first question, you can maintain a basic template repository with a certain directory structure and all the associated software dependencies. You can then use a Github fork to instantiate it duplicate it as another repository, or you can just set the template repository as an additional upstream remote. If you treat it as a Github fork you don't have to communicate with yourself through pull requests (although that can be very helpful for logging), just pull from the upstream repo. You can do this from a new repository or retrofit existing repos with the template:

git remote add template https://github.com/me/basic-template
git pull template master

To make the template repo read-only (to avoid accidental push of non-template code), delete the push URL:

git remote set-url --delete --push template

There are other ways to do this, and possibly a better way than this to manage updates conveniently.


For the second question, I would include the other dependencies as Git submodules (if a Git repo is available). My project directory would look something like:

/external
/scripts
/bin
/lib
/include
/src

For example, BWA can be sourced as:

git submodule add https://github.com/lh3/bwa external/bwa

Git modules can be frozen to a specific version, or updated (e.g., git submodule foreach git pull followed by git submodule sync).

In each of these modules you would install their libraries and binaries with --prefix=$PROJ_DIR (where $PROJ_DIR is the top root of your Git repo). So their build artifacts go into the appropriate locations (/bin, /lib, /include). You can use /src and /scripts for homegrown scripts.

Hope this helps.

ADD COMMENT
0
Entering edit mode

You can also attach a top-level Makefile which builds each of the dependent applications/submodules, and even include a target for updating their sources using the git submodule command. This way you only need to recompile when sources change rather than have to manually examine each submodule.

ADD REPLY
0
Entering edit mode

Yes, this helps a lot with the second part of my question. Thanks!

I guess I am still wondering about the best way to make my various projects inherit from this original bioinformatics_analysis template, and a way in terms of best practice of incorporating improvements back to the template, but keeping this separate from the actual specific analysis stuff which will also be in the repository (but shouldn't make it back to bioinformatics_template).

ADD REPLY
0
Entering edit mode

Okay, I missed the essence of the primary question. I'm modifying my answer to accommodate.

ADD REPLY
0
Entering edit mode

OK, this is a great start, thanks again, but I suspect it isn't going to fulfil my requirements completely. I think I will have a play with the method you suggest and come back with some further clarifications/questions. One initial sticking point is that you can't fork your own repo within Github, although you can clone and rename the remote. Branches are another option but feel intuitively a bit wrong because then each project won't have its own named repo.

ADD REPLY
0
Entering edit mode

For duplicating your own repository, you can try this.

ADD REPLY
0
Entering edit mode

Are there some scripts that automate the rebuilding of spawned child analyses every time the template changes, and then commits them so they get a new commit hash?

ADD REPLY
3
Entering edit mode
11.1 years ago
Shaun Jackman ▴ 420

This post doesn't answer your git workflow question, but I think it's relevant since you mention managing tools like bwa and samtools. I use Homebrew on OS X and Linuxbrew on Linux to manage installing and upgrading bioinformatics tools. A number of bioinformatics tools are packaged up in Hombrew-science. It makes installing tools and multiple versions of tools really easy (e.g. brew install samtools bwa) and doesn't require root access, so can be used on servers and clusters.

ADD COMMENT
3
Entering edit mode
11.1 years ago

It should be noted that some dependency tools like Pip respect git repositories, branches, and commit hashes e.g. (in requirements.txt)

git+git://github.com/me/myrepo.git@932964b371efa6ef9bfbca2b2dfcccd6181c7764

Also worth a peek is git annex for tracking binary dependencies

I have been toying with the idea of writing a white paper on using git for binf data analysis but I think git still has a little ways to go on this front. I think the average analyst would find resolving some of the errors git throws regarding untracked files very frustrating.

ADD COMMENT
0
Entering edit mode

I think if a system that fulfilled most of my needs could be detailed it would be worth writing up as a blog post or even a journal article. Happy to help in that regard.

ADD REPLY
1
Entering edit mode
11.1 years ago

Here is a good stackoverflow post on cloning branches into a new repository: http://stackoverflow.com/questions/9527999/how-do-i-create-a-new-github-repo-from-a-branch-in-an-existing-repo

So you can keep one "base" repository containing general scripts. When you start a new project, you can clone a branch off of the base repo into a new repo.

I am not sure if the standard directory structure for software development (src, ext, bin...) would apply to data analysis workflows though. I imagine I would really just store a bunch of scripts and perhaps links to external software.

ADD COMMENT
0
Entering edit mode
11.1 years ago
ugly.betty77 ★ 1.1k

The answer is not relevant with updated question.

ADD COMMENT
0
Entering edit mode

Thanks for your comment. My question is really about how I can have a canonical 'template' repository for my basic workflow, then create clones/forks of that analysis in a separate repository, but then easily send improvements that I make over time back to the template, that again would ideally flow to my other projects that were cloned from the original repository. I mean, I know how to do it in a roundabout way, but am looking for best practices in terms of the Github featureset that will help me achieve that. Presumably then other people could benefit from this system, but that's not my primary goal.

ADD REPLY
0
Entering edit mode

Thanks. Your updated question makes more sense to me, but I am too new to this Biostar thing to figure out whether I should update my answer too :)

ADD REPLY

Login before adding your answer.

Traffic: 2032 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6