I am currently using a relational database to store snps and genomic sequence features. I am trying to find out whether snps reside in a sequence feature and I am finding it very slow even with judicious application of indexes and batch updates in a transaction. For example this query (which is pseudocode and just an abstraction of what i am doing for convenience)
update snp, exon set snp.isexonic = true where snp pos >= exon.start and snp.pos <=exon.end
is extremely slow even with indexes on all of the start and end positions and the updates performed in a single commit so that indexes aren't updated until the end (although there aren't indexes on the updated field). I found it much faster to loop around every entry in the exon table and get the start and end position for each exon and then do an update
for each exon update snp set isexonic = true where snp.pos >= ? and snp.pos <=?
where ? are the placeholders for the positions of each exon in turn. But even this is painfully slow due to the volume of data. I was wondering if there were other techniques that I am not aware of for working with this type of data. I asked a perl programmer whether it would be quicker to extract the data into perl structures, perform the comparisons and then update the database but they suggested that was overkill and I should stick with the database.
Are there any database techniques that I could use to speed this up? Is there a column type that might help. I presume this must be a common problem in bioinformatics so I thought I would ask the esteemed members of this forum for guidance.
I've seen the binning scheme used at UCSC but do not know enough about it to know whether this is an option I could employ on my own local data or whether that too is overkill