Hello all :)
There are many biology-related graph databases out there, which makes a lot of sense - Biological problems are a good fit for graphs. Unfortunately, many of the graph databases I've come across have not been particularly up to date (a PhD project long forgotten), suffer from having seriously overly complicated schema, and/or are just simply not downloadable (you can query their graph via the website, but not download and use the database locally to issue complex queries)
I believe this combination of problems has led to graph databases not seeing as much use as they probably should get as an generic format for biological data. It also doesn't help that there are few good in-browser visualisation tools to graphically query graphs without learning gremlin or Cypher syntax (SQL for graphs) which is what a lot of the biologists and already-overloaded Bioinformaticans would probably appreciate.
- Rebuild popular SQL databases like UCSC's table browser into a MUCH simpler graph database. Allow users to download the full graph, or sub-graphs, with an easy query form.
- Create a generic CSV parser/importer for genomic data. It would roll out the columns with multiple values per cell (exons,VCF) understand biological formats (chr:start-end formats), help you design a schema for the resultant graph, and finally output a graph database.
The former option would be simple, but laborious. It would also need constant updating to keep it in-sync with UCSC/etc, and as soon as their SQL schema changes, everything breaks. :P
The latter would be more challenging to make, more complicated to use, but provides the most flexibility and compatibility down the road.
Before I set down the road to building out one or both of these paths, does anyone know if either options already exist? Is there an easier way I have overlooked?
Would you like to help me remake the UCSC database (or parts of it) as a graph database?
Thank you all in advance! :D
Graph databases are great for searching for relations between nodes. What's cool with UCSC is that you can quickly search for features in genomic regions (bin index). Think about this before starting writing you database. The table Snp142 contains 115,877,267 records. Do you really want to put this in a graph?
100 million records? Easy! :D It used to be the case that Neo only handled around 10 million nodes, but the latest version maxes out at 32 billion nodes and relationships, and 64 billion key/value properties. Of course you'd need some serious hardware to make that a reality, but people do frequently go up into the billions with it: http://firstname.lastname@example.org/msg08183.html
But still - whole thing hinges on a good question and a good schema to answer that question. Im not sure a big/slow database is even a bad thing. Looking for a simple pattern in a 115,877,267 graph would take a long long time - by Google standards. But I think our problems are more about asking 1 really good question, after having a very long think about it. For that, producing a bunch of different schema for different kinds of question is probably a better use of time than optimising one schema to fit all queries. Here's a schema I made recently to handle the issue of overlapping regions:
Three kinds of node (labels in Neo4j terminology):
In this schema, even if it was a billion nodes and 3 billion relationships, you could pull out 'all signal in genes X and Y', modifying the query only slightly to differentiate between include overlapping regions, and exclude overlapping regions. You could even plug it in to a genome browser if it supported it.