Question: Find Nearest Gene Upstream Using Mysql And Perl Program
3
gravatar for anon111
5.9 years ago by
anon11140
anon11140 wrote:

There is a table of genes in mysql. I have a list of reg. elements, their start position on a chromosome, and their end position on a chromosome. How do you write a mysql command for perl program to search for the gene (from a table of genes downloaded in mysql) nearest each element. How do you make the perl program run faster since there are many elements for which you have to search for the nearest gene?

EDIT*

I am new to mysql and perl and have little programming background. Could you explain a little more? I must use a table of genes (that has over a 150K genes) that I downloaded into a mysql database. So i have a regulatory element and its start position, end position, and chromosome number. I have a table of genes. how do i match the gene closest upstream or downstream to this reg. element most easily? i will need to include a subroutine for finding the closest gene too. The thing I'm stuck on is how to find the closest gene.

• 3.4k views
ADD COMMENTlink modified 5.9 years ago by Istvan Albert ♦♦ 78k • written 5.9 years ago by anon11140

to the OP please don't delete this post as it contains a few outstanding contributions

ADD REPLYlink written 5.9 years ago by Istvan Albert ♦♦ 78k
6
gravatar for Alex Reynolds
5.9 years ago by
Alex Reynolds27k
Seattle, WA USA
Alex Reynolds27k wrote:

If you want to consider an alternative approach:

  1. Export the gene table from MySQL to a UCSC-formatted BED file (you'll want at least four tab-delimited columns: chromosome, start, stop and gene name)
  2. Export your regulatory elements to another UCSC-formatted BED file (you'll want at least three columns: chromosome, start and stop)
  3. Both BED files should first be sorted, _e.g._ with BEDOPS sort-bed or alternatives:

    $ sort-bed rawRegulatoryElements.bed > regulatoryElements.bed

    $ sort-bed rawGenes.bed > genes.bed

    Sorting allows other BEDOPS tools (like the one I'm about to mention shortly) to operate and return results very quickly.

  4. Run BEDOPS closest-features on the regulatory elements and gene table BED files, _e.g._:

    $ closest-features --delim '\t' --shortest regulatoryElements.bed genes.bed > answer.bed

This command reports the nearest gene upstream or downstream to each regulatory element. Results are stored in a file called answer.bed, which is also BED-formatted.

ADD COMMENTlink modified 5.9 years ago • written 5.9 years ago by Alex Reynolds27k

this is a nice one. thanks!

ADD REPLYlink written 5.9 years ago by Pavel Senin1.9k
4
gravatar for brentp
5.9 years ago by
brentp22k
Salt Lake City, UT
brentp22k wrote:

I spent a lot of time digging down that rabbit hole. If you want to follow, you can use something like this:

SELECT * from (
 (SELECT * FROM ((SELECT * FROM feature WHERE start<= ? ORDER BY start DESC  LIMIT 1) UNION (SELECT * FROM feature where start>= ?
ORDER BY start LIMIT 1)) as u)
 UNION
 (SELECT * FROM ((SELECT * FROM feature where stop<= ? ORDER BY stop   DESC  LIMIT 1) UNION (SELECT * FROM feature where stop>= ?
ORDER BY stop LIMIT 1)) as v)
) as w
ORDER BY ABS((start + stop)/2 - ?) LIMIT 1

To get the nearest neighbor using indexes (if your table has btree indexes on start and end). But, unless you have only a few intervals, it's simpler and faster to read the database features into some kind of an interval tree and query from there.

ADD COMMENTlink written 5.9 years ago by brentp22k
2
gravatar for Pavel Senin
5.9 years ago by
Pavel Senin1.9k
Los Alamos, NM
Pavel Senin1.9k wrote:

You would need to "nail down" in the SQL query the coordinate and say that you want to get all the entities with a position above or below the specified one, then you need to LIMIT the set which SQL returns to 1 entity, something like:

select * from `gene` where `start` >10000 order by `start` limit 1;

If you need to see which gene is closer, upstream or downstream, - do it in Perl. Don't worry about performance in Perl, MySQL is pretty smart to figure out that stuff for you. But, I would recommend to index your table by the gene start/end positions.

p.s editing - forgot to put order by....

ADD COMMENTlink modified 5.9 years ago • written 5.9 years ago by Pavel Senin1.9k
2
gravatar for Sean Davis
5.9 years ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

If you want to use mysql, consider using either spatial indexing or hack indexes using code such as that used by UCSC and tabix. See the last section of the SAM format document for details on creating and using a simple indexing scheme. In UCSC tables, you will often see a column with the name "bin"; this is another implementation of the same scheme.

ADD COMMENTlink modified 5.9 years ago • written 5.9 years ago by Sean Davis25k

well, could you tell us, why not to use mysql?

ADD REPLYlink written 5.9 years ago by Pavel Senin1.9k
1

I should have been less directive. I'd suggest benchmarking the actual application using other approaches. Edited to reflect that change.

ADD REPLYlink written 5.9 years ago by Sean Davis25k

because, there are a ton of edge-cases. in your query below, what if you have 10 genes with start == 10000? did you remember to select by chromosome? what if you need to consider strand?

ADD REPLYlink written 5.9 years ago by brentp22k

I agree that it takes some time to go through hoops, like the one you say about chromosome ;), however SQL was specifically designed to retrieve subsets of data entities in meaningful way in a very fast manner, with a great degree of flexibility, moreover, it integrates with any modern language. I would not exclude SQL from workflow just because query gets long - it stays Structured.

ADD REPLYlink written 5.9 years ago by Pavel Senin1.9k

To answer your question, performance is likely to be an issue compared to the other toolsets I mentioned. However, performance is always a tricky thing to pin down which is why I suggest benchmarking.

ADD REPLYlink written 5.9 years ago by Sean Davis25k
1
gravatar for Sean Davis
5.9 years ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

Take a look at the ChiPpeakAnno Bioconductor package here: http://www.bioconductor.org/packages/release/bioc/html/ChIPpeakAnno.html. The second use case in the vignette is on finding the nearest gene to a feature.

ADD COMMENTlink written 5.9 years ago by Sean Davis25k
1
gravatar for brentp
5.9 years ago by
brentp22k
Salt Lake City, UT
brentp22k wrote:

Since Sean is breaking out the R package :). There's also a python package that can do this: https://pypi.python.org/pypi/cruzdb

Syntax would be something like:

hg19 = Genome('hg19')
for chrom, start, end in list_of_intervals:
    print hg19.knearest('refGene', chrom, start, end, k=2)

or if you want upstream only, replace "knearest" with "upstream".

ADD COMMENTlink written 5.9 years ago by brentp22k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 773 users visited in the last hour