There's much talk on my twittersphere about this piece in MIT Technology Review: Rebooting the Human Genome. It talks about how the current reference genome concept misses so much of the human variation that we need to capture as we sequence more and more people's personal genome data.

But there's also some confusion. I don't think the concepts of the graphs was really well described in there. In an earlier thread here we talked about it a little, but I wasn't able to find the talk I'd heard about this which was helpful to me. But I found a similar one, and maybe this will help people to get the idea of the graphs instead of just the current linear view we have of the reference genome.

You can watch the whole thing, of course. But the part about the graph ideas come in to this talk around 52 minutes.

So the idea is that we have to be able to account for the "bubbles" that don't match a linear reference string. Some bubbles will be alterations, some insertions, some deletions, some inversions--but we can capture this with graph representations that go beyond our current tools. But they are all valid, and we need to know and see this variation better.

Anyway, I'm posting because I think it's important to be aware of. And I think that even researchers in the field aren't that familiar with the ideas yet.

This paper was also helpful to me to understand the concepts, but unfortunately is not open access: Building a pan-genome reference for a population. doi: 10.1089/cmb.2014.0146 http://www.ncbi.nlm.nih.gov/pubmed/25565268

If anyone else has good introductions to the representations of these variant graph concepts I'd like to see them.

Edit to add: this paper has some of Haussler's graphs too: http://arxiv.org/abs/1404.5010

The author of the MIT piece provided this slide set: https://docs.google.com/presentation/d/1utWF1_Er6bfAAwYWWRvDL-XI73uFC17WNi45t-TXveM/edit#slide=id.p That's helpful too.