Question: How to check if a gene is comprised in a certain interval? with R
1
3.7 years ago by
Pietro70
Italy
Pietro70 wrote:

Hi all,

In R, I have a data frame with physical position of some genes in bp on the left and intervals (again in bp) on the right. I want to check for each gene, whether it falls in the the interval of the same chromosome.

 GeneID Chr Position Interval x 1 18697251 1:3099740-8804450 y 1 19546617 1:3422930-8804450 z 1 3332236 2:2751757-4502486 a 2 3993537 2:6187995-8804450 b 2 9983384 2:4864334-18271005 c 2 11479025 2:8222941-18271005

For example: Is gene x on chr 1 comprised in one of the 2 intervals displayed for chr 1? I expect TRUE or FALSE.

I managed to do this giving the gene positions one by one, but the list is very big, I'd like this to be automated.

Any help much appreciated!

gene sequence R genome • 1.2k views
modified 3.7 years ago • written 3.7 years ago by Pietro70
2
3.7 years ago by
Alex Reynolds28k
Seattle, WA USA
Alex Reynolds28k wrote:

If you use `strsplit()`, you could test this directly. For instance:

```> v <- '1:3099740-8804450'
> unlist(strsplit(v, '[:-]'))
[1] "1"       "3099740" "8804450"```

Then you can make a function out of it and print out the result of the desired tests via `apply()`:

```> t <- read.table("some.table", header=T)
> interval <- function(str) unlist(strsplit(str, '[:-]'))
> apply(t, 1, function(x) (x[2] == interval(x[4])[1]) && (as.numeric(x[3]) > as.numeric(interval(x[4])[2])) && (as.numeric(x[3]) < as.numeric(interval(x[4])[3])))
[1] FALSE FALSE FALSE FALSE  TRUE  TRUE```

We leave the chromosome name as a character vector to deal with cases like `X` and `Y`, and convert the position and interval position elements to numerics via `as.numeric()` to apply numerical relation operators.

The `apply` call can also be turned into a function:

```> interval_test <- function(t) apply(t, 1, function(x) (x[2] == interval(x[4])[1]) && (as.numeric(x[3]) > as.numeric(interval(x[4])[2])) && (as.numeric(x[3]) < as.numeric(interval(x[4])[3])))
```

This function can be used to filter your table for rows that fall within the interval:

```> t[interval_test(t),]
GeneID Chr Position           Interval
5      b   2  9983384 2:4864334-18271005
6      c   2 11479025 2:8222941-18271005```

Or, perhaps, used to filter for rows which do not fall within the interval, by using the `!` operator:

```> t[!interval_test(t),]
GeneID Chr Position          Interval
1      x   1 18697251 1:3099740-8804450
2      y   1 19546617 1:3422930-8804450
3      z   1  3332236 2:2751757-4502486
4      a   2  3993537 2:6187995-8804450```

Much more elegant than my answer !

1

In R, replace `for` loops with calls to `apply` and similar functions in the `apply` family, whenever possible. The code is cleaner and often runs faster.

I now that in theory, but in practice, the loops comes to me more naturally than the apply synthax. Shame on me ^^

PS : in your code, you assume that` interval(x[4])[2]) < interval(x[4])[3]) `which is not always the case in the OP example, perhaps for genes on the - strand (see the first gene x for instance). It depends of what you want to select of course, but in your case, every time that `interval(x[4])[2]) > interval(x[4])[3])`, the function will return false.

Thanks Alex, but if i got your answer right, in this way I only test the corresponding interval on the right. Instead, in my data frame I have 3 genes on chr 1 each to be tested on 2 intervals, and 3 genes on chr 2 each to be tested on 4 intervals.

Please tell me if it's not clear.

Thanks again.

Ah, I misunderstand your question. You could perhaps collate intervals in all rows to per-chromosome interval lists, then write another function to test each row's chromosome and position against all the intervals in the associated chromosome's list. I think there's enough detail in my answer to make a start on pre-processing the table into interval lists, but it might be easier to just export both your intervals and positions to separate BED files, and then use BEDOPS bedops --element-of to see where there are overlaps.

To summarize the data by gene and buidling on Alex answer, you could try :

```t\$interval_test = interval_test(t) # simply attach the T/F array to the dataframe
by( t\$GeneID, t\$interval_test, sum) # return the number of TRUE by gene.
by( t\$GeneID, t\$interval_test, function(x) sum(x)/length(x)) # return number of TRUE by gene divided by the number of entry in you dataframe by gene.```

1
3.7 years ago by
Carlo Yague4.6k
Belgium
Carlo Yague4.6k wrote:

I have a solution with a loop, although it is not very elegant (code not tested, you may need to debug it). I assume that all your data is as factors (this is why I use `as.numeric(as.character())` for instance).

```n=length(myData\$GeneID)
answer=rep(NA,n)   # this will be an array of TRUE/FALSE
for (i in 1:n){
temp = strsplit(as.character(myData\$Interval[i]),":")[[1]]  # that gives you c("1","309974-880445")
answer[i] = temp[1]==as.character(myData\$Chr[i])  #first test chromosome
if (answer[i]){    # no need to go further if the chromosome is wrong
temp = strsplit(temp[2],"-")[[1]]   # that gives you c("309974","880445")
pos = as.numeric(as.character(myData\$Position[i])) # in case your data is as factor
# Now lets test if position is between intervals.
# I assume here that the interval can be in the "wrong" order and that its ok for you.
answer[i] = (pos < temp[1] & pos > temp2) | (pos > temp[1] & pos < temp2)
}
}```
```round(sum(answer)/length(answer)*100,2) # give you percentage of rightly positionned genes.
myData\$GeneID[!answer]  # give you the names of the wrongly positionned genes.
```
0
3.7 years ago by
Pietro70
Italy
Pietro70 wrote:

Thanks to both of you!