Question

Couchdb (Or Similar) For Annotations, Alignments, Etc.

11

Entering edit mode

12.3 years ago

Marvin ▴ 900

I'd like to store genomic annotations (think bed or bam files, lots of bam files in fact) in some sort of database, preferably something that is distributed and facilitates parallel processing (something like MapReduce). CouchDB, MongoDB, Cassandra seem to fit the bill more or less.

Such a scheme only makes sense if it's possible to run queries like "give me all annotations overlapping this region", or something like 'samtools view'. Which database supports this kind of interval index or where would it be possible to retrofit one?

To clarify: I'm aware of the UCSC binning scheme, which looks like an ugly hack to me. The BAM format adds another hack to that, apparently because the binning scheme underperforms for bigger data sets. What I'm actually looking for is something like the priority search queue described here: http://www.soi.city.ac.uk/~ross/papers/FingerTree.html (It works for every search tree, not just finger trees.)

database annotation • 4.6k views

ADD COMMENT • link updated 12.3 years ago by eduardo.pareja.tobes ▴ 60 • written 12.3 years ago by Marvin ▴ 900

0

Entering edit mode

Well, IMO the UCSC binning is a very clever hack for a database engine that does not support the R-tree index. You do not have a much better way to do that with MySQL. Also, correct me if I am wrong: finger tree seems an alternative to AVL and red-black tree. If this is the case, it is not efficient for interval queries. K-d tree and R-tree are the appropriate data structures for interval queries. UCSC binning is essentially an R-tree.

ADD REPLY • link 12.3 years ago by lh3 33k

0

Entering edit mode

About MongoDB. I heard that MongoDB is lightening fast when the data fits memory, but otherwise it becomes quite slow. A Google result: http://www.colinhowe.co.uk/2011/02/23/mongodb-performance-for-data-bigger-than-memor/

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 12.3 years ago by lh3 33k

0

Entering edit mode

The point isn't really about Finger Trees, any balancing scheme works. The point is about the annotation, which might actually be equivalent to the (1-dimensional) R-tree.

ADD REPLY • link 12.3 years ago by Marvin ▴ 900

0

Entering edit mode

I'll add my +1 to MongoDB.

ADD REPLY • link 12.3 years ago by Caddymob ▴ 1000

0

Entering edit mode

No, it is wrong that "any balancing scheme works". Interval query is a special type of query which makes many tree structures inefficient. You cannot use a B-tree, AVL or red-black tree to achieve the efficiency of the binning index. This is exactly why UCSC came up with this binning index when B-tree is widely used in all RDBMS. If your point is not about finger tree, then remove it to avoid confusion.

ADD REPLY • link 12.0 years ago by lh3 33k

0

Entering edit mode

Hinze's finger tree is equivalent to an annotated (2,3)-B-tree, which is again equivalent to ab annotated red-black-tree. I'm asking if any database can annotate its B-tree index.

ADD REPLY • link 12.3 years ago by Marvin ▴ 900

1

Entering edit mode

You see not understand the problem. You keep mentioning annotation, but what you really want to solve is an interval query problem. B-tree, together with finger, red-black and AVL trees, is a wrong data structure for interval queries. No matter how much you can annotate these kind of trees, you cannot achieve efficient interval query. UCSC binning is the only method I know that works efficiently on B-tree.

ADD REPLY • link 12.0 years ago by lh3 33k

2

Entering edit mode

What the (marvelous) Hinze-Paterson paper says (section 4.8) is that you can implement (1-dim) interval trees via finger trees, just by choosing the right monoid; you get O(log(n)) for finding some overlapping interval, and getting all m of them in O(m log(m/n)). Not an expert in this area, but AFAIK this is as efficient as you can get.

In general, finger trees are best thought of as a "machine" producing data structures, given a monoid, not as a single data structure. See this blog post for example Apfelmus - monoids and finger trees

ADD REPLY • link 12.0 years ago by eduardo.pareja.tobes ▴ 60

0

Entering edit mode

It seems that I was wrong about finger tree. Probably I was just making a fuss over the sentence "[the UCSC binning scheme] looks like an ugly hack". Without the last paragraph, which is certainly inaccurate, this is actually a pretty good question.

ADD REPLY • link 12.0 years ago by lh3 33k

2

Entering edit mode

Well, actually comparing interval trees as implemented with finger trees (let's call this finger interval trees) and traditional interval trees it's kind of an apples to oranges comparison: finger interval trees (as any finger tree data structure) are persistent data structures. From a performance standpoint, query times for finger interval trees are amortized, require a language with support for lazy evaluation, and are a bit worse than that of traditional interval trees; you get however the benefits of a persistent data structure.

ADD REPLY • link 12.0 years ago by eduardo.pareja.tobes ▴ 60

0

Entering edit mode

Thanks for the info. Now I understand why few implemented fingertree - because without lazy evaluation, implementing a fingertree is challenging. I wish someone could compared the performance of fingertree to other traditional data structures (better theoretical complexity may not always guarantee practical efficiency), but I could not find it.

ADD REPLY • link 12.0 years ago by lh3 33k

score 10 · Answer 1 · 2012-01-05

You can also store such files with sqlite using R-Trees.

I was actually doing that for a while -- series of tables that would hold ts-start and ts-end coords for "genes" might look something like this:

CREATE TABLE gene(
  id INTEGER PRIMARY KEY, -- ID to link with the RTree Index
  entrez_id VARCHAR,      -- entrez id
  chromosome_id INTEGER,  -- Fkey to chromosome.id
  strand INTEGER,         -- 1 : plus, 2 : minus
  FOREIGN KEY (chromosome_id) REFERENCES chromosome(id)
);
CREATE UNIQUE INDEX geneId on gene(entrez_id);
CREATE INDEX geneChromosome on gene(chromosome_id);
CREATE INDEX geneStrand on gene(strand);

-- For quick retrievel of genes that lie between pts x and y on a chromosome
-- build a query around this RTree
CREATE VIRTUAL TABLE gene_tree USING rtree_i32(
  id,                     -- primary & foreign key to gene.id
  start,                  -- transcription start
  end,                    -- transcription end
  chr1,                   -- bound on the chromosome (fkey to chromosome.id)
  chr2                    -- same
);

You can do overlap queries on your rtree tables by querying start/stop and chromosome then join them by their id to the appropriate table for more meta information about that "gene". I reckon you can add another "strand" dimension to the rtree if you want do the entire query in one shot.

This being sqlite and all, it doesn't really qualify for your "distributed" requirement, but there are other databases, like PostgreSQL, that support rtree's, too.

Ram · Answer 2 · 2012-05-23

If you want to use a db, I'd use neo4j, in particular neo4j-spatial; you have R-trees and all that, and the graph db model is well-suited for this kind of thing (shameless plug: check bio4j).

If what you want is a data structure for stabbing queries (give me all of the intervals containing this point/interval x), as all your values are from a discrete range, there are data structures with O(k) query times where k is the number of interval containing x. The important thing here is that this is independent of the number of non-matching intervals. This could make a huge difference in a setting like yours. See Interval Stabbing Problems in Small Integer Ranges

score 1 · Answer 3 · 2012-01-05

1

Entering edit mode

12.3 years ago

Pierre Lindenbaum 161k

"give me all annotations overlapping this region"

a RDBMS is enough: the UCSC database uses a "bin indexing system" to quickly retrieve this kind of information: http://genomewiki.ucsc.edu/index.php/Bin_indexing_system . Of course, you could use this strategy to store your data in any NOSQL system.

ADD COMMENT • link 12.0 years ago by Pierre Lindenbaum 161k

score 1 · Answer 4 · 2012-01-05

1

Entering edit mode

12.3 years ago

Istvan Albert 100k

File based interval databases/datastructures are strangely uncommon, two that I heard of are both python based:

the Interval Index class in bx python
and various classes in pygr

ADD COMMENT • link 12.3 years ago by Istvan Albert 100k

3

Entering edit mode

Broad's Tribble library has Java IntervalTree implementations: http://code.google.com/p/tribble/ http://code.google.com/p/tribble/source/browse/trunk/src/org/broad/tribble/index/interval/IntervalTree.java

ADD REPLY • link 12.3 years ago by Brad Chapman 9.7k

0

Entering edit mode

can you explain how this works in conjunction with a database?

ADD REPLY • link 12.0 years ago by Jeremy Leipzig 22k

Istvan Albert · Answer 5 · 2012-05-13

1

Entering edit mode

12.0 years ago

Kim Rutherford ▴ 10

For this sort of problem, if you're willing to use PostgreSQL, the BIOSEG type might help.

From the introduction: "BIOSEG is an implementation of a biological sequence interval/range type for PostgreSQL. A PostgreSQL GiST index is used to improve the speed of overlap, contains and contained-in queries"

Some examples of usage: http://www.bioinformatics.org/bioseg/wiki/Main/BiosegUsage

The code is quite stable and is used in production by the InterMine project.

(Disclaimer: I maintain BIOSEG)

ADD COMMENT • link 12.0 years ago by Kim Rutherford ▴ 10

0

Entering edit mode

It looks like PostgreSQL v9.2 will have a standard range type, which I'm sure will be more stable and better optimised than BIOSEG: http://www.postgresql.org/docs/devel/static/rangetypes.html

ADD REPLY • link updated 12.0 years ago by Istvan Albert 100k • written 12.0 years ago by Kim Rutherford ▴ 10