Traffic: 180 ip/hr
Question: MongoDB: What's the most efficient way to store a genomic position
 
6
 
 

I want to store some genomic positions using MongoDB.

something like:

{
chrom:"chr2",
position:100,
name:"rs25"
}

I want to be able to quickly find all the records in a given segment. What would be the best key/_id to be used ?

a chrom , position object ?

db.snps.save({_id:{chrom:"chr2",position:100},name:"rs25"})

a padded string ?

db.snps.save({_id:"chr02:00000000100",chrom:"chr2",position:100,name:"rs25"})

an auto-generated id with an index on chrom and position ?

db.snps.save({chrom:"chr2",position:100,name:"rs25"})

other ?

???

thanks for your suggestion(s)

Pierre

PS: I cross-posted this question on stackoverflow http://stackoverflow.com/questions/3740112

log in to comment • 2 bookmarks • permalink similar posts • request help via email
 
1

I posted a benchmark on my blog: http://plindenbaum.blogspot.com/2010/09/indexing-some-genomic-positions-with.html

log in to reply • written 2.7 years ago by Pierre Lindenbaum ♦♦ 46,63063381

1 answer

 
3
 
 
 

If you're going to be using mongodb to do "spatial" queries, have a look here. It's using a geohash for 2d indexes, but you can likely shoe-horn your 1d data into it. Then you'd be able to take advantage of their spatial queries like nearest and within bounds.

Another option is to hash your 1-d intervals yourself--like you do with padded string. intuitively, that must have the best locality in the B-Tree. I suspect with your options above, you'd have to run a benchmark to see if there were any noticeable differences.

Some time ago, I wrote biohash/interval-hash that would work on 1d intervals as geohash does on 2d points, it's not fully thought out, but could be a decent starting point.

 

thanks, I'm going to validate this interesting answer.

log in to reply • written 2.7 years ago by Pierre Lindenbaum ♦♦ 46,63063381
 
Log in to add a post