Question

R Merging By Either Of Two Alternate Columns

0

Entering edit mode

3.3 years ago

canto12project • 0

I'm running into a bit of an issue as a new R user, as I have two genome annotation tables that I need to merge by the start and stop positions of each gene. The gene names themselves and the gene functions are to different to merge on. While many of my genes have the same start/stop in both tables, many only have the same start or the same end.

Is there a way to easily use some kind of "or" command to fix this?

An example of the two dataframes is below:

Data Table 1

Start  Stop  Gene_Name                 
1        6       A
8        15      B
21       27      C

Data Table 2

Start  Stop  Gene_Name
1       6    Gene1
8       14   Gene2
21      25   Gene3

R • 666 views

ADD COMMENT • link updated 3.3 years ago by rpolicastro 13k • written 3.3 years ago by canto12project • 0

0

Entering edit mode

Is there always a matching start or stop, or can you have overlaps of ranges without having the same start and/or stop?

ADD REPLY • link 3.3 years ago by rpolicastro 13k

0

Entering edit mode

There should always be a matching start or stop

ADD REPLY • link 3.3 years ago by canto12project • 0

score 1 · Answer 1 · 2021-01-08

If you are fine with just joining any overlapping ranges despite having an exact match to the start or stop, data.table would be the best solution. If not go to the tidyverse solution at the end.

I'll assume you're starting from two data.frames we will call DT1 and DT2 respectively.

DT1 <- structure(list(Start = c(1L, 8L, 21L), Stop = c(6L, 15L, 27L), 
    Gene_Name = c("A", "B", "C")), class = "data.frame", row.names = c(NA, 
-3L))

DT2 <- structure(list(Start = c(1L, 8L, 21L), Stop = c(6L, 14L, 25L), 
    Gene_Name = c("Gene1", "Gene2", "Gene3")), class = "data.frame", row.names = c(NA, 
-3L))

Using foverlaps from data.table to join based on any overlaps.

library("data.table")

setDT(DT1, key=c("Start", "Stop"))
setDT(DT2, key=c("Start", "Stop"))

overlaps <- foverlaps(DT2, DT1)

> overlaps
   Start Stop Gene_Name i.Start i.Stop i.Gene_Name
1:     1    6         A       1      6       Gene1
2:     8   15         B       8     14       Gene2
3:    21   27         C      21     25       Gene3

If you really need to join by either a matching start or stop the easy/ugly way would be using tidyverse. Someone might come around with a quicker and more clever solution. If they do listen to them.

library("tidyverse")

overlaps <- DT1 %>%
  rowwise %>%
  mutate(matches=list(DT2[DT2$Start == Start | DT2$Stop == Stop, ])) %>%
  unnest(matches, names_repair="unique")

> overlaps
# A tibble: 3 x 6
  Start...1 Stop...2 Gene_Name...3 Start...4 Stop...5 Gene_Name...6
      <int>    <int> <chr>             <int>    <int> <chr>        
1         1        6 A                     1        6 Gene1        
2         8       15 B                     8       14 Gene2        
3        21       27 C                    21       25 Gene3