Question: Fast Method To Detect Whether Snp Lies In Sequence Feature
4
gravatar for Pi
9.8 years ago by
Pi520
Pi520 wrote:

Greetings

I am currently using a relational database to store snps and genomic sequence features. I am trying to find out whether snps reside in a sequence feature and I am finding it very slow even with judicious application of indexes and batch updates in a transaction. For example this query (which is pseudocode and just an abstraction of what i am doing for convenience)

update snp, exon set snp.isexonic = true where snp pos >= exon.start and snp.pos <=exon.end

is extremely slow even with indexes on all of the start and end positions and the updates performed in a single commit so that indexes aren't updated until the end (although there aren't indexes on the updated field). I found it much faster to loop around every entry in the exon table and get the start and end position for each exon and then do an update

  for each exon
          update snp set isexonic = true where snp.pos >= ? and snp.pos <=?

where ? are the placeholders for the positions of each exon in turn. But even this is painfully slow due to the volume of data. I was wondering if there were other techniques that I am not aware of for working with this type of data. I asked a perl programmer whether it would be quicker to extract the data into perl structures, perform the comparisons and then update the database but they suggested that was overkill and I should stick with the database.

Are there any database techniques that I could use to speed this up? Is there a column type that might help. I presume this must be a common problem in bioinformatics so I thought I would ask the esteemed members of this forum for guidance.

I've seen the binning scheme used at UCSC but do not know enough about it to know whether this is an option I could employ on my own local data or whether that too is overkill

thank you

database • 2.7k views
ADD COMMENTlink modified 9.8 years ago by Michael Dondrup48k • written 9.8 years ago by Pi520
1

See page 1003 and Figure 7 of: Kent et al. The human genome browser at UCSC. Genome Res (2002) vol. 12 (6) pp. 996-1006 There's even an example query. In short, you would have to loop through all of the SNPs in each 128kb bin to detect overlaps. The idea is that while there may be many false positives in that bin, it will be far less than for the whole genome. All in all, databases are not well-suited to this problem.

ADD REPLYlink written 9.8 years ago by Aaronquinlan11k

hi. it was that figure that confused me because the smallest bin in the diagram is 1bp :) If the paper is just saying that once you have found your bin your have to loop through all the features in that bin (rather than the whole genome) then I understand the process. I thought I was missing something more profound than that.

ADD REPLYlink written 9.8 years ago by Pi520

i've removed the edit as it came out a bit big!

ADD REPLYlink written 9.8 years ago by Pi520
7
gravatar for Istvan Albert
9.8 years ago by
Istvan Albert ♦♦ 86k
University Park, USA
Istvan Albert ♦♦ 86k wrote:

I think your best approach is to compute the necessary flags outside of the database with bedtools or some other tool/algorithm (like tabix) designed to deal with intervals (but not Perl or Python unless you have the interval specific code written in C). It is very easy to dump the data into GFF or BED, compute the overlaps and generate the sql necessary to update your database.

I think this will save you a lot of time and aggravation.

ADD COMMENTlink written 9.8 years ago by Istvan Albert ♦♦ 86k
2

This is correct, a RDBMS is not the right tool to support inteval-overlapping. In addition we should mention (again) IRanges in Bioconductor, which is imho the most advanced and easy to use tool for this task.

ADD REPLYlink written 9.8 years ago by Michael Dondrup48k

thanks a lot. I am just looking into this. If i wanted to look to see if I had any co-located snps, is it correct to say I could look to see if they had the same bin because a snp is only a single base pair and colocated snps would reside in the same bin which would be the lowest level (smallest) bin?

ADD REPLYlink written 9.8 years ago by Pi520

thanks a lot. I am just looking into this. If i wanted to look to see if I had any co-located snps, is it correct to say that the smallest bin/interval level would have to be 1bp. I could look to see if they had the same bin because a snp is only a single base pair and colocated snps would reside in the same bin which would be the lowest level. Are there guides for chosing the interval size because the ucsc paper says their smallest bin was 128kb. This seems quite large

ADD REPLYlink written 9.8 years ago by Pi520

thanks a lot. I am just looking into this. If i wanted to look to see if I had any co-located snps, is it correct to say that the smallest bin/interval level would have to be 1bp. I could look to see if they had the same bin because a snp is only a single base pair and colocated snps would reside in the same bin which would be the lowest level. Are there guides for chosing the interval size because the ucsc paper says their smallest bin was 128kb. This seems quite large and wouldn't let you find colocated snps would it?

ADD REPLYlink written 9.8 years ago by Pi520

Also, how do you recommend calculating the bins/intervals needed to query/retrieve data from the databse? Do i need to use tabix to calculate the intervals for retrieval? I'm also interested as to why you say don't use perl or python to deal with the intervals. Is it because it would be too slow? Would perl/python be ok to calculate 'one-off' intervals needed for a retrieval query or should they be avoided altogether

ADD REPLYlink written 9.8 years ago by Pi520

I think i might have confused the issue with the 'isexonic' flag in the question. That was an abstraction for this question. Most of the time I will not be setting flags like this (though sometimes i do) but querying my database to retrieve overlapping features such as colocated snps or snps in exons. In that case i need to give each feature a bin or interval when it is added to the database and then calcuate bins for queries. I think tabix and bedtools will calcuate the interval/bin for a feature but not work out the bin needed to query a database? If that is correct what would you recommend

ADD REPLYlink written 9.8 years ago by Pi520

I should also add I've never used R or Bioconductor so that woudl not be the first choice for me given my current skill set

ADD REPLYlink written 9.8 years ago by Pi520
3
gravatar for Pierre Lindenbaum
9.8 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum134k wrote:

Yes you can use a 'binning' algorithm to speed up this kind of query. This kind of algorithm is used in most tables of the ucsc. See:

Other suggestions:

  • have you created some indexes in your database ?
  • Can you tell your SQL engine that your data are already sorted ?
  • Can you first create a temporary table (no transaction) containing the new data and then later, update your whole database (transaction).
  • (...)
ADD COMMENTlink modified 17 months ago by Ram32k • written 9.8 years ago by Pierre Lindenbaum134k

how would i tell a mysql engine that the rows in the table are sorted. does that mean it can scan the index from the point it last looked rather than going from the beginning? have you ever used the binning scheme yourself? Thank you

ADD REPLYlink written 9.8 years ago by Pi520

how would i tell a mysql engine that the rows in the table are sorted. does that mean it can scan the index from the point it last looked rather than going from the beginning? Thank you

ADD REPLYlink written 9.8 years ago by Pi520

i've read your blog and seen the function that you create to find the appropriate bins to search. I've also read the ucsc paper but i dont understand how i would create the bins on my own database tables and assign a feature to a bin. I Don't supposed you have a blog about that :o)

ADD REPLYlink written 9.8 years ago by Pi520

do i just use the java function to assign the genomic feature to a bin and then store the bin in a database field bin?

ADD REPLYlink written 9.8 years ago by Pi520

i've seen your blog. great job! do i just use the java function to assign the genomic feature to a bin and then store the bin in a database field bin?

ADD REPLYlink written 9.8 years ago by Pi520

it is not standard, but I think there are some SQL engines optimizing their queries if they 'know' that the data are sorted.

ADD REPLYlink written 9.8 years ago by Pierre Lindenbaum134k

Yes, you can use the java function or the original Kent's code.

ADD REPLYlink written 9.8 years ago by Pierre Lindenbaum134k

have you tried in on a database in practice yourself. It seems straightforward (now you have written the function!!!)

ADD REPLYlink written 9.8 years ago by Pi520

looking at your blog again, did you do what the commentator suggested and compared the bin to a concatenated index on the start and end fields?

ADD REPLYlink written 9.8 years ago by Pi520

no, I didn't .

ADD REPLYlink written 9.8 years ago by Pierre Lindenbaum134k

Kent has compared that in the 2002 paper. PostgreSQL can build R-tree index as I remember.

ADD REPLYlink written 9.8 years ago by lh332k
2
gravatar for Michael Dondrup
9.8 years ago by
Bergen, Norway
Michael Dondrup48k wrote:

The question for fast interval overlapping has been asked several times here,e.g. : What Is The Quickest Algorithm For Range Overlap? A SNP is an interval of width 1. Why does it take so long with an RDBMS? Because you are using the naive algorithm which requires 2 * # SNPs * # of Genes comparisons, while with a more efficient algorithm, a lot of comparisons turn out to be redundant. I recommend to not try to attempt this in SQL but use an external implementation as mentioned in the previous question.

ADD COMMENTlink modified 17 months ago by Ram32k • written 9.8 years ago by Michael Dondrup48k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 977 users visited in the last hour
_