Question

How to check if a gene is comprised in a certain interval? with R

1

Entering edit mode

9.6 years ago

Pietro ▴ 240

Hi all,

In R, I have a data frame with physical position of some genes in bp on the left and intervals (again in bp) on the right. I want to check for each gene, whether it falls in the the interval of the same chromosome.

GeneID       Chr       Position                Interval
x            1         18697251                1:3099740-8804450
y            1         19546617                1:3422930-8804450
z            1         3332236                 2:2751757-4502486
a            2         3993537                 2:6187995-8804450
b            2         9983384                 2:4864334-18271005
c            2         11479025                2:8222941-18271005

For example: Is gene x on chr 1 comprised in one of the 2 intervals displayed for chr 1? I expect TRUE or FALSE.

I managed to do this giving the gene positions one by one, but the list is very big, I'd like this to be automated.

Any help much appreciated!

gene R genome sequence • 3.1k views

ADD COMMENT • link updated 2.9 years ago by Ram 45k • written 9.6 years ago by Pietro ▴ 240

0

Entering edit mode

Thanks to both of you!

ADD REPLY • link 9.6 years ago by Pietro ▴ 240

Ram · Accepted Answer · 2015-11-18

2

Entering edit mode

9.6 years ago

Alex Reynolds 36k

If you use strsplit(), you could test this directly. For instance:

> v <- '1:3099740-8804450'
> unlist(strsplit(v, '[:-]'))
[1] "1"       "3099740" "8804450"

Then you can make a function out of it and print out the result of the desired tests via apply():

> t <- read.table("some.table", header=T)
> interval <- function(str) unlist(strsplit(str, '[:-]'))
> apply(t, 1, function(x) (x[2] == interval(x[4])[1]) && (as.numeric(x[3]) > as.numeric(interval(x[4])[2])) && (as.numeric(x[3]) < as.numeric(interval(x[4])[3])))
[1] FALSE FALSE FALSE FALSE  TRUE  TRUE

We leave the chromosome name as a character vector to deal with cases like X and Y, and convert the position and interval position elements to numerics via as.numeric() to apply numerical relation operators.

The apply call can also be turned into a function:

> interval_test <- function(t) apply(t, 1, function(x) (x[2] == interval(x[4])[1]) && (as.numeric(x[3]) > as.numeric(interval(x[4])[2])) && (as.numeric(x[3]) < as.numeric(interval(x[4])[3])))

This function can be used to filter your table for rows that fall within the interval:

> t[interval_test(t),]
  GeneID Chr Position           Interval
5      b   2  9983384 2:4864334-18271005
6      c   2 11479025 2:8222941-18271005

Or, perhaps, used to filter for rows which do not fall within the interval, by using the ! operator:

> t[!interval_test(t),]
  GeneID Chr Position          Interval
1      x   1 18697251 1:3099740-8804450
2      y   1 19546617 1:3422930-8804450
3      z   1  3332236 2:2751757-4502486
4      a   2  3993537 2:6187995-8804450

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 9.6 years ago by Alex Reynolds 36k

0

Entering edit mode

Much more elegant than my answer !

ADD REPLY • link 9.6 years ago by Carlo Yague 9.0k

1

Entering edit mode

In R, replace for loops with calls to apply and similar functions in the apply family, whenever possible. The code is cleaner and often runs faster.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 9.6 years ago by Alex Reynolds 36k

0

Entering edit mode

I now that in theory, but in practice, the loops comes to me more naturally than the apply synthax. Shame on me ^^

PS : in your code, you assume that interval(x[4])[2]) < interval(x[4])[3]) which is not always the case in the OP example, perhaps for genes on the - strand (see the first gene x for instance). It depends of what you want to select of course, but in your case, every time that interval(x[4])[2]) > interval(x[4])[3]), the function will return false.

ADD REPLY • link 9.6 years ago by Carlo Yague 9.0k

0

Entering edit mode

Thanks Alex, but if i got your answer right, in this way I only test the corresponding interval on the right. Instead, in my data frame I have 3 genes on chr 1 each to be tested on 2 intervals, and 3 genes on chr 2 each to be tested on 4 intervals.

Please tell me if it's not clear.

Thanks again.

ADD REPLY • link 9.6 years ago by Pietro ▴ 240

0

Entering edit mode

Ah, I misunderstand your question. You could perhaps collate intervals in all rows to per-chromosome interval lists, then write another function to test each row's chromosome and position against all the intervals in the associated chromosome's list. I think there's enough detail in my answer to make a start on pre-processing the table into interval lists, but it might be easier to just export both your intervals and positions to separate BED files, and then use BEDOPS bedops --element-of to see where there are overlaps.

ADD REPLY • link 9.6 years ago by Alex Reynolds 36k

0

Entering edit mode

To summarize the data by gene and building on Alex answer, you could try:

t$interval_test = interval_test(t) # simply attach the T/F array to the dataframe
by( t$GeneID, t$interval_test, sum) # return the number of TRUE by gene.
by( t$GeneID, t$interval_test, function(x) sum(x)/length(x)) # return number of TRUE by gene divided by the number of entry in you dataframe by gene.

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 9.6 years ago by Carlo Yague 9.0k

Ram · Accepted Answer · 2015-11-18

I have a solution with a loop, although it is not very elegant (code not tested, you may need to debug it). I assume that all your data is as factors (this is why I use as.numeric(as.character()) for instance).

n=length(myData$GeneID)
answer=rep(NA,n)   # this will be an array of TRUE/FALSE
for (i in 1:n){
  temp = strsplit(as.character(myData$Interval[i]),":")[[1]]  # that gives you c("1","309974-880445")
  answer[i] = temp[1]==as.character(myData$Chr[i])  #first test chromosome
  if (answer[i]){    # no need to go further if the chromosome is wrong
    temp = strsplit(temp[2],"-")[[1]]   # that gives you c("309974","880445")
    pos = as.numeric(as.character(myData$Position[i])) # in case your data is as factor
   # Now lets test if position is between intervals.
   # I assume here that the interval can be in the "wrong" order and that its ok for you.
    answer[i] = (pos < temp[1] & pos > temp2) | (pos > temp[1] & pos < temp2)
  }
}

round(sum(answer)/length(answer)*100,2) # give you percentage of rightly positioned genes.
myData$GeneID[!answer]  # give you the names of the wrongly positionned genes.