Question: Get upstream and downstream intergenic regions in R
1
3.6 years ago by
greentailedmouse50 wrote:

Hi all,

I would like to get the upstream and downstream intergenic distance between two genes in R (taking into account that some genes may overlap). For example - in the following case I would expect to get a the upstream distance of 5 for gene A, a distance 0 or negative for downstream of A, and a distance of 10 for downstream of B/upstream of C.

``````-----|____gene A____|-------------------------|____gene_C___|---     + strand
---------------|______gene_B_______|----------------------------     - strand
``````

I tried to use precede() and follow() in GRanges but these two functions completely exclude overlapping genes from the analysis. This sounds like a task everyone is doing and it should be part of a package but I can't find anything appropriate.

Thank you for any hints!

intergenic granges R • 1.5k views
modified 3.6 years ago by ddiez1.8k • written 3.6 years ago by greentailedmouse50

I think your problem might be similar to that of finding the distance between motifs asked in this question. My answer there could be a starting point.

OK on second thought I see you want to obtain a negative distance for overlapping genes (to tell the amount of overlapping I guess). I am not sure how can you get this... (might think about it later).

2
3.6 years ago by
ddiez1.8k
Japan
ddiez1.8k wrote:

Here is some attempt. I create an example ranges (IRanges for simplicity). You can used this modified version of `plotRanges()` function (from the IRangesOverview vignette in the IRanges package) to visualize the ranges and how they overlap.

The problem using follow() and precede() is that they ignore overlapping ranges. Therefore here instead compute the sequence-wide comparison of distances. Maybe this needs to be modified to be more efficient and only do this in the closest sequence. However, with overlapping sequences it is not so clear to me what closest means.

``````library(IRanges)
foo <- IRanges(start = c(1, 7, 10, 5, 13), end = c(4, 12, 15, 8, 14))

plotRanges <- function(x,
xlim = x,
main = deparse(substitute(x)),
col = "black",
sep = 0.5,
...)

{
height <- 1
if (is(xlim, "Ranges"))
xlim <-
c(min(start(xlim)), max(end(xlim)))
bins <-
disjointBins(IRanges(start(x), end(x) + 1))
plot.new()
plot.window(xlim, c(0, max(bins) * (height + sep)))
ybottom <-
bins * (sep + height) - height
rect(start(x), ybottom, end(x), ybottom + height, col = col, ...)
text(c(start(x) + width(x)/2) - .5, y = ybottom + height /2)
title(main)
r <- range(x)
axis(1, at = seq(start(r), end(r)))
}

plotRanges(foo, col = rainbow(length(foo)))
foo
IRanges object with 5 ranges and 0 metadata columns:
start       end     width
<integer> <integer> <integer>
[1]         1         4         4
[2]         7        12         6
[3]        10        15         6
[4]         5         8         4
[5]        13        14         2
``````

I compute the sequence-wise comparison as this:

``````# comparison matrix.
combmatrix <- t(combn(length(foo), 2))
combmatrix
colnames(combmatrix) <- c("i", "j")
i j
[1,] 1 2
[2,] 1 3
[3,] 1 4
[4,] 1 5
[5,] 2 3
[6,] 2 4
[7,] 2 5
[8,] 3 4
[9,] 3 5
``````

Then iterate over the comparison matrix and compute the distances:

``````hoo <- apply(combmatrix, 1, function(x) {
s1 <- foo[x[1], ]
s2 <- foo[x[2], ]
if (start(s1) < start(s2))
start(s2) - end(s1)
else
start(s1) - end(s2)
})

cbind(combmatrix, distance = hoo)
i j distance
[1,] 1 2        3
[2,] 1 3        6
[3,] 1 4        1
[4,] 1 5        9
[5,] 2 3       -2
[6,] 2 4       -1
[7,] 2 5        1
[8,] 3 4        2
[9,] 3 5       -2
[10,] 4 5        5
``````

Note that for the comparison between sequences 3 and 5, it is not clear to me what the right answer would be in this case. I guess this is the reason distances to overlapping ranges are zero in IRanges (and GRanges).