Question: Filter for nonoverlapping genomic hits and keep the longest one
0
gravatar for lessismore
5 weeks ago by
lessismore990
Mexico
lessismore990 wrote:

Dear all,

i'm dealing with genomic coordinates of TF binding site mapping and i would like to filter the huge FIMO output for the non-overlapping sites. For each overlapping site i would like to keep the longest hit. Is there a way or a package to do this in R ? here i attach an example of my data:

family  start   end
ERF 1891    1911
ERF 1896    1915
ERF 1896    1914
ERF 1897    1911
ERF 1690    1704
ERF 937 957
ERF 1680    1700
ERF 1891    1905
ERF 2789    2809
ERF 642 661
ERF 2788    2806
ERF 1890    1908
ERF 1001    1015
ERF 2789    2803

to import it in R:

starting_data <- structure(list(family = c("ERF", "ERF", "ERF", "ERF", "ERF", 
"ERF", "ERF", "ERF", "ERF", "ERF", "ERF", "ERF", "ERF", "ERF"
), start = c(1891L, 1896L, 1896L, 1897L, 1690L, 937L, 1680L, 
1891L, 2789L, 642L, 2788L, 1890L, 1001L, 2789L), end = c(1911L, 
1915L, 1914L, 1911L, 1704L, 957L, 1700L, 1905L, 2809L, 661L, 
2806L, 1908L, 1015L, 2803L)), class = "data.frame", row.names = c(NA, 
-14L))

What i would like to have:

family  start   end
ERF 1891    1911
ERF 1680    1700
ERF 937 957
ERF 2789    2809
ERF 642 661
ERF 1001    1015

to import it in R

desired_output <- structure(list(family = c("ERF", "ERF", "ERF", "ERF", "ERF", 
"ERF"), start = c(1891L, 1680L, 937L, 2789L, 642L, 1001L), end = c(1911L, 
1700L, 957L, 2809L, 661L, 1015L)), class = "data.frame", row.names = c(NA, 
-6L))

To apply this solution to all my output, including the other TF families, i'll use the group_by(family) option on tidyverse.
Thanks in advance for any tip

ADD COMMENTlink modified 5 weeks ago by rpolicastro3.9k • written 5 weeks ago by lessismore990
1
gravatar for rpolicastro
5 weeks ago by
rpolicastro3.9k
Bloomington, IN
rpolicastro3.9k wrote:

Convert into IRanges.

library("IRanges")
library("tidyverse")

df <- starting_data %>%
  split(.$family) %>%
  map(~IRanges(start=.x$start, end=.x$end))

Merge reduced set of ranges into the original ranges, and then select the longest range of overlapping ranges.

longest <- df %>%
  map(~findOverlapPairs(.x, reduce(.x)) %>% as_tibble) %>%
  bind_rows(.id="family") %>%
  group_by(across(c(starts_with("second"), "family"))) %>%
  filter(first.width == max(first.width)) %>%
  ungroup %>%
  rename_with(.cols=starts_with("first"), .fn=~str_replace(.x, "first.", ""))

> longest
# A tibble: 6 x 7
  family start   end width second.start second.end second.width
  <chr>  <int> <int> <int>        <int>      <int>        <int>
1 ERF     1891  1911    21         1890       1915           26
2 ERF      937   957    21          937        957           21
3 ERF     1680  1700    21         1680       1704           25
4 ERF     2789  2809    21         2788       2809           22
5 ERF      642   661    20          642        661           20
6 ERF     1001  1015    15         1001       1015           15
ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by rpolicastro3.9k

Congrats <3 That's a great solution!

ADD REPLYlink written 5 weeks ago by lessismore990
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1933 users visited in the last hour
_