R Merging By Either Of Two Alternate Columns
1
0
Entering edit mode
3.3 years ago

I'm running into a bit of an issue as a new R user, as I have two genome annotation tables that I need to merge by the start and stop positions of each gene. The gene names themselves and the gene functions are to different to merge on. While many of my genes have the same start/stop in both tables, many only have the same start or the same end.

Is there a way to easily use some kind of "or" command to fix this?

An example of the two dataframes is below:

Data Table 1

Start  Stop  Gene_Name                 
1        6       A
8        15      B
21       27      C

Data Table 2

Start  Stop  Gene_Name
1       6    Gene1
8       14   Gene2
21      25   Gene3
R • 666 views
ADD COMMENT
0
Entering edit mode

Is there always a matching start or stop, or can you have overlaps of ranges without having the same start and/or stop?

ADD REPLY
0
Entering edit mode

There should always be a matching start or stop

ADD REPLY
1
Entering edit mode
3.3 years ago

If you are fine with just joining any overlapping ranges despite having an exact match to the start or stop, data.table would be the best solution. If not go to the tidyverse solution at the end.

I'll assume you're starting from two data.frames we will call DT1 and DT2 respectively.

DT1 <- structure(list(Start = c(1L, 8L, 21L), Stop = c(6L, 15L, 27L), 
    Gene_Name = c("A", "B", "C")), class = "data.frame", row.names = c(NA, 
-3L))

DT2 <- structure(list(Start = c(1L, 8L, 21L), Stop = c(6L, 14L, 25L), 
    Gene_Name = c("Gene1", "Gene2", "Gene3")), class = "data.frame", row.names = c(NA, 
-3L))

Using foverlaps from data.table to join based on any overlaps.

library("data.table")

setDT(DT1, key=c("Start", "Stop"))
setDT(DT2, key=c("Start", "Stop"))

overlaps <- foverlaps(DT2, DT1)

> overlaps
   Start Stop Gene_Name i.Start i.Stop i.Gene_Name
1:     1    6         A       1      6       Gene1
2:     8   15         B       8     14       Gene2
3:    21   27         C      21     25       Gene3

If you really need to join by either a matching start or stop the easy/ugly way would be using tidyverse. Someone might come around with a quicker and more clever solution. If they do listen to them.

library("tidyverse")

overlaps <- DT1 %>%
  rowwise %>%
  mutate(matches=list(DT2[DT2$Start == Start | DT2$Stop == Stop, ])) %>%
  unnest(matches, names_repair="unique")

> overlaps
# A tibble: 3 x 6
  Start...1 Stop...2 Gene_Name...3 Start...4 Stop...5 Gene_Name...6
      <int>    <int> <chr>             <int>    <int> <chr>        
1         1        6 A                     1        6 Gene1        
2         8       15 B                     8       14 Gene2        
3        21       27 C                    21       25 Gene3
ADD COMMENT

Login before adding your answer.

Traffic: 2044 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6