thanks, I'm going to validate this interesting answer.
I want to store some genomic positions using MongoDB.
something like:
{
chrom:"chr2",
position:100,
name:"rs25"
}
I want to be able to quickly find all the records in a given segment. What would be the best key/_id to be used ?
a chrom , position object ?
db.snps.save({_id:{chrom:"chr2",position:100},name:"rs25"})
a padded string ?
db.snps.save({_id:"chr02:00000000100",chrom:"chr2",position:100,name:"rs25"})
an auto-generated id with an index on chrom and position ?
db.snps.save({chrom:"chr2",position:100,name:"rs25"})
other ?
???
thanks for your suggestion(s)
Pierre
PS: I cross-posted this question on stackoverflow http://stackoverflow.com/questions/3740112
If you're going to be using mongodb to do "spatial" queries, have a look here. It's using a geohash for 2d indexes, but you can likely shoe-horn your 1d data into it. Then you'd be able to take advantage of their spatial queries like nearest and within bounds.
Another option is to hash your 1-d intervals yourself--like you do with padded string. intuitively, that must have the best locality in the B-Tree. I suspect with your options above, you'd have to run a benchmark to see if there were any noticeable differences.
Some time ago, I wrote biohash/interval-hash that would work on 1d intervals as geohash does on 2d points, it's not fully thought out, but could be a decent starting point.
I posted a benchmark on my blog: http://plindenbaum.blogspot.com/2010/09/indexing-some-genomic-positions-with.html