Fast Method To Detect Whether Snp Lies In Sequence Feature
3
4
Entering edit mode
10.1 years ago
Pi ▴ 520

Greetings

I am currently using a relational database to store snps and genomic sequence features. I am trying to find out whether snps reside in a sequence feature and I am finding it very slow even with judicious application of indexes and batch updates in a transaction. For example this query (which is pseudocode and just an abstraction of what i am doing for convenience)

update snp, exon set snp.isexonic = true where snp pos >= exon.start and snp.pos <=exon.end


is extremely slow even with indexes on all of the start and end positions and the updates performed in a single commit so that indexes aren't updated until the end (although there aren't indexes on the updated field). I found it much faster to loop around every entry in the exon table and get the start and end position for each exon and then do an update

  for each exon
update snp set isexonic = true where snp.pos >= ? and snp.pos <=?


where ? are the placeholders for the positions of each exon in turn. But even this is painfully slow due to the volume of data. I was wondering if there were other techniques that I am not aware of for working with this type of data. I asked a perl programmer whether it would be quicker to extract the data into perl structures, perform the comparisons and then update the database but they suggested that was overkill and I should stick with the database.

Are there any database techniques that I could use to speed this up? Is there a column type that might help. I presume this must be a common problem in bioinformatics so I thought I would ask the esteemed members of this forum for guidance.

I've seen the binning scheme used at UCSC but do not know enough about it to know whether this is an option I could employ on my own local data or whether that too is overkill

thank you

database • 2.7k views
1
Entering edit mode

See page 1003 and Figure 7 of: Kent et al. The human genome browser at UCSC. Genome Res (2002) vol. 12 (6) pp. 996-1006 There's even an example query. In short, you would have to loop through all of the SNPs in each 128kb bin to detect overlaps. The idea is that while there may be many false positives in that bin, it will be far less than for the whole genome. All in all, databases are not well-suited to this problem.

0
Entering edit mode

hi. it was that figure that confused me because the smallest bin in the diagram is 1bp :) If the paper is just saying that once you have found your bin your have to loop through all the features in that bin (rather than the whole genome) then I understand the process. I thought I was missing something more profound than that.

0
Entering edit mode

i've removed the edit as it came out a bit big!

7
Entering edit mode
10.0 years ago

I think your best approach is to compute the necessary flags outside of the database with bedtools or some other tool/algorithm (like tabix) designed to deal with intervals (but not Perl or Python unless you have the interval specific code written in C). It is very easy to dump the data into GFF or BED, compute the overlaps and generate the sql necessary to update your database.

I think this will save you a lot of time and aggravation.

2
Entering edit mode

This is correct, a RDBMS is not the right tool to support inteval-overlapping. In addition we should mention (again) IRanges in Bioconductor, which is imho the most advanced and easy to use tool for this task.

0
Entering edit mode

thanks a lot. I am just looking into this. If i wanted to look to see if I had any co-located snps, is it correct to say I could look to see if they had the same bin because a snp is only a single base pair and colocated snps would reside in the same bin which would be the lowest level (smallest) bin?

0
Entering edit mode

thanks a lot. I am just looking into this. If i wanted to look to see if I had any co-located snps, is it correct to say that the smallest bin/interval level would have to be 1bp. I could look to see if they had the same bin because a snp is only a single base pair and colocated snps would reside in the same bin which would be the lowest level. Are there guides for chosing the interval size because the ucsc paper says their smallest bin was 128kb. This seems quite large

0
Entering edit mode

thanks a lot. I am just looking into this. If i wanted to look to see if I had any co-located snps, is it correct to say that the smallest bin/interval level would have to be 1bp. I could look to see if they had the same bin because a snp is only a single base pair and colocated snps would reside in the same bin which would be the lowest level. Are there guides for chosing the interval size because the ucsc paper says their smallest bin was 128kb. This seems quite large and wouldn't let you find colocated snps would it?

0
Entering edit mode

Also, how do you recommend calculating the bins/intervals needed to query/retrieve data from the databse? Do i need to use tabix to calculate the intervals for retrieval? I'm also interested as to why you say don't use perl or python to deal with the intervals. Is it because it would be too slow? Would perl/python be ok to calculate 'one-off' intervals needed for a retrieval query or should they be avoided altogether

0
Entering edit mode

I think i might have confused the issue with the 'isexonic' flag in the question. That was an abstraction for this question. Most of the time I will not be setting flags like this (though sometimes i do) but querying my database to retrieve overlapping features such as colocated snps or snps in exons. In that case i need to give each feature a bin or interval when it is added to the database and then calcuate bins for queries. I think tabix and bedtools will calcuate the interval/bin for a feature but not work out the bin needed to query a database? If that is correct what would you recommend

0
Entering edit mode

I should also add I've never used R or Bioconductor so that woudl not be the first choice for me given my current skill set

3
Entering edit mode
10.1 years ago

Yes you can use a 'binning' algorithm to speed up this kind of query. This kind of algorithm is used in most tables of the ucsc. See:

Other suggestions:

• have you created some indexes in your database ?
• Can you first create a temporary table (no transaction) containing the new data and then later, update your whole database (transaction).
• (...)
0
Entering edit mode

how would i tell a mysql engine that the rows in the table are sorted. does that mean it can scan the index from the point it last looked rather than going from the beginning? have you ever used the binning scheme yourself? Thank you

0
Entering edit mode

how would i tell a mysql engine that the rows in the table are sorted. does that mean it can scan the index from the point it last looked rather than going from the beginning? Thank you

0
Entering edit mode

i've read your blog and seen the function that you create to find the appropriate bins to search. I've also read the ucsc paper but i dont understand how i would create the bins on my own database tables and assign a feature to a bin. I Don't supposed you have a blog about that :o)

0
Entering edit mode

do i just use the java function to assign the genomic feature to a bin and then store the bin in a database field bin?

0
Entering edit mode

i've seen your blog. great job! do i just use the java function to assign the genomic feature to a bin and then store the bin in a database field bin?

0
Entering edit mode

it is not standard, but I think there are some SQL engines optimizing their queries if they 'know' that the data are sorted.

0
Entering edit mode

Yes, you can use the java function or the original Kent's code.

0
Entering edit mode

have you tried in on a database in practice yourself. It seems straightforward (now you have written the function!!!)

0
Entering edit mode

looking at your blog again, did you do what the commentator suggested and compared the bin to a concatenated index on the start and end fields?

0
Entering edit mode

no, I didn't .

0
Entering edit mode

Kent has compared that in the 2002 paper. PostgreSQL can build R-tree index as I remember.

2
Entering edit mode
10.0 years ago

The question for fast interval overlapping has been asked several times here,e.g. : What Is The Quickest Algorithm For Range Overlap? A SNP is an interval of width 1. Why does it take so long with an RDBMS? Because you are using the naive algorithm which requires 2 * # SNPs * # of Genes comparisons, while with a more efficient algorithm, a lot of comparisons turn out to be redundant. I recommend to not try to attempt this in SQL but use an external implementation as mentioned in the previous question.