Question: R Merging By Either Of Two Alternate Columns
0
gravatar for canto12project
15 days ago by
canto12project0 wrote:

I'm running into a bit of an issue as a new R user, as I have two genome annotation tables that I need to merge by the start and stop positions of each gene. The gene names themselves and the gene functions are to different to merge on. While many of my genes have the same start/stop in both tables, many only have the same start or the same end.

Is there a way to easily use some kind of "or" command to fix this?

An example of the two dataframes is below:

Data Table 1

Start  Stop  Gene_Name                 
1        6       A
8        15      B
21       27      C

Data Table 2

Start  Stop  Gene_Name
1       6    Gene1
8       14   Gene2
21      25   Gene3
R • 133 views
ADD COMMENTlink modified 15 days ago by rpolicastro3.2k • written 15 days ago by canto12project0

Is there always a matching start or stop, or can you have overlaps of ranges without having the same start and/or stop?

ADD REPLYlink written 15 days ago by rpolicastro3.2k

There should always be a matching start or stop

ADD REPLYlink written 15 days ago by canto12project0
1
gravatar for rpolicastro
15 days ago by
rpolicastro3.2k
Bloomington, IN
rpolicastro3.2k wrote:

If you are fine with just joining any overlapping ranges despite having an exact match to the start or stop, data.table would be the best solution. If not go to the tidyverse solution at the end.

I'll assume you're starting from two data.frames we will call DT1 and DT2 respectively.

DT1 <- structure(list(Start = c(1L, 8L, 21L), Stop = c(6L, 15L, 27L), 
    Gene_Name = c("A", "B", "C")), class = "data.frame", row.names = c(NA, 
-3L))

DT2 <- structure(list(Start = c(1L, 8L, 21L), Stop = c(6L, 14L, 25L), 
    Gene_Name = c("Gene1", "Gene2", "Gene3")), class = "data.frame", row.names = c(NA, 
-3L))

Using foverlaps from data.table to join based on any overlaps.

library("data.table")

setDT(DT1, key=c("Start", "Stop"))
setDT(DT2, key=c("Start", "Stop"))

overlaps <- foverlaps(DT2, DT1)

> overlaps
   Start Stop Gene_Name i.Start i.Stop i.Gene_Name
1:     1    6         A       1      6       Gene1
2:     8   15         B       8     14       Gene2
3:    21   27         C      21     25       Gene3

If you really need to join by either a matching start or stop the easy/ugly way would be using tidyverse. Someone might come around with a quicker and more clever solution. If they do listen to them.

library("tidyverse")

overlaps <- DT1 %>%
  rowwise %>%
  mutate(matches=list(DT2[DT2$Start == Start | DT2$Stop == Stop, ])) %>%
  unnest(matches, names_repair="unique")

> overlaps
# A tibble: 3 x 6
  Start...1 Stop...2 Gene_Name...3 Start...4 Stop...5 Gene_Name...6
      <int>    <int> <chr>             <int>    <int> <chr>        
1         1        6 A                     1        6 Gene1        
2         8       15 B                     8       14 Gene2        
3        21       27 C                    21       25 Gene3
ADD COMMENTlink modified 15 days ago • written 15 days ago by rpolicastro3.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 989 users visited in the last hour
_