Question: How to check if a gene is comprised in a certain interval? with R
1
gravatar for Pietro
3.7 years ago by
Pietro70
Italy
Pietro70 wrote:

Hi all,

In R, I have a data frame with physical position of some genes in bp on the left and intervals (again in bp) on the right. I want to check for each gene, whether it falls in the the interval of the same chromosome.

    GeneID     Chr     Position            Interval
    x     1    18697251           1:3099740-8804450
    y     1    19546617           1:3422930-8804450
    z     1    3332236           2:2751757-4502486
    a     2    3993537           2:6187995-8804450
    b     2    9983384           2:4864334-18271005
    c     2    11479025           2:8222941-18271005

For example: Is gene x on chr 1 comprised in one of the 2 intervals displayed for chr 1? I expect TRUE or FALSE.

I managed to do this giving the gene positions one by one, but the list is very big, I'd like this to be automated.

Any help much appreciated!

gene sequence R genome • 1.2k views
ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by Pietro70
2
gravatar for Alex Reynolds
3.7 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

If you use strsplit(), you could test this directly. For instance:

> v <- '1:3099740-8804450'
> unlist(strsplit(v, '[:-]'))
[1] "1"       "3099740" "8804450"

Then you can make a function out of it and print out the result of the desired tests via apply():

> t <- read.table("some.table", header=T)
> interval <- function(str) unlist(strsplit(str, '[:-]'))
> apply(t, 1, function(x) (x[2] == interval(x[4])[1]) && (as.numeric(x[3]) > as.numeric(interval(x[4])[2])) && (as.numeric(x[3]) < as.numeric(interval(x[4])[3])))
[1] FALSE FALSE FALSE FALSE  TRUE  TRUE

We leave the chromosome name as a character vector to deal with cases like X and Y, and convert the position and interval position elements to numerics via as.numeric() to apply numerical relation operators.

The apply call can also be turned into a function:

> interval_test <- function(t) apply(t, 1, function(x) (x[2] == interval(x[4])[1]) && (as.numeric(x[3]) > as.numeric(interval(x[4])[2])) && (as.numeric(x[3]) < as.numeric(interval(x[4])[3])))

This function can be used to filter your table for rows that fall within the interval:

> t[interval_test(t),]
  GeneID Chr Position           Interval
5      b   2  9983384 2:4864334-18271005
6      c   2 11479025 2:8222941-18271005

Or, perhaps, used to filter for rows which do not fall within the interval, by using the ! operator:

> t[!interval_test(t),]
  GeneID Chr Position          Interval
1      x   1 18697251 1:3099740-8804450
2      y   1 19546617 1:3422930-8804450
3      z   1  3332236 2:2751757-4502486
4      a   2  3993537 2:6187995-8804450
ADD COMMENTlink modified 3.7 years ago • written 3.7 years ago by Alex Reynolds28k

Much more elegant than my answer !

ADD REPLYlink written 3.7 years ago by Carlo Yague4.6k
1

In R, replace for loops with calls to apply and similar functions in the apply family, whenever possible. The code is cleaner and often runs faster.

ADD REPLYlink written 3.7 years ago by Alex Reynolds28k

I now that in theory, but in practice, the loops comes to me more naturally than the apply synthax. Shame on me ^^

PS : in your code, you assume that interval(x[4])[2]) < interval(x[4])[3]) which is not always the case in the OP example, perhaps for genes on the - strand (see the first gene x for instance). It depends of what you want to select of course, but in your case, every time that interval(x[4])[2]) > interval(x[4])[3]), the function will return false.

ADD REPLYlink written 3.7 years ago by Carlo Yague4.6k

Thanks Alex, but if i got your answer right, in this way I only test the corresponding interval on the right. Instead, in my data frame I have 3 genes on chr 1 each to be tested on 2 intervals, and 3 genes on chr 2 each to be tested on 4 intervals.

Please tell me if it's not clear.

Thanks again.

ADD REPLYlink written 3.7 years ago by Pietro70

Ah, I misunderstand your question. You could perhaps collate intervals in all rows to per-chromosome interval lists, then write another function to test each row's chromosome and position against all the intervals in the associated chromosome's list. I think there's enough detail in my answer to make a start on pre-processing the table into interval lists, but it might be easier to just export both your intervals and positions to separate BED files, and then use BEDOPS bedops --element-of to see where there are overlaps.

ADD REPLYlink written 3.7 years ago by Alex Reynolds28k

To summarize the data by gene and buidling on Alex answer, you could try :

t$interval_test = interval_test(t) # simply attach the T/F array to the dataframe
by( t$GeneID, t$interval_test, sum) # return the number of TRUE by gene.
by( t$GeneID, t$interval_test, function(x) sum(x)/length(x)) # return number of TRUE by gene divided by the number of entry in you dataframe by gene.

 

ADD REPLYlink modified 3.7 years ago • written 3.7 years ago by Carlo Yague4.6k
1
gravatar for Carlo Yague
3.7 years ago by
Carlo Yague4.6k
Belgium
Carlo Yague4.6k wrote:

I have a solution with a loop, although it is not very elegant (code not tested, you may need to debug it). I assume that all your data is as factors (this is why I use as.numeric(as.character()) for instance).

 

n=length(myData$GeneID)
answer=rep(NA,n)   # this will be an array of TRUE/FALSE
for (i in 1:n){
  temp = strsplit(as.character(myData$Interval[i]),":")[[1]]  # that gives you c("1","309974-880445")
  answer[i] = temp[1]==as.character(myData$Chr[i])  #first test chromosome
  if (answer[i]){    # no need to go further if the chromosome is wrong
    temp = strsplit(temp[2],"-")[[1]]   # that gives you c("309974","880445")
    pos = as.numeric(as.character(myData$Position[i])) # in case your data is as factor
   # Now lets test if position is between intervals.
   # I assume here that the interval can be in the "wrong" order and that its ok for you.
    answer[i] = (pos < temp[1] & pos > temp2) | (pos > temp[1] & pos < temp2)
  }
}
round(sum(answer)/length(answer)*100,2) # give you percentage of rightly positionned genes.
myData$GeneID[!answer]  # give you the names of the wrongly positionned genes.
   
ADD COMMENTlink written 3.7 years ago by Carlo Yague4.6k
0
gravatar for Pietro
3.7 years ago by
Pietro70
Italy
Pietro70 wrote:

Thanks to both of you!

ADD COMMENTlink written 3.7 years ago by Pietro70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1561 users visited in the last hour