How to combine a data frame with another data frame containing comma-separated values?
1
1
Entering edit mode
7 weeks ago
sasa ▴ 10

Does anyone have a good idea to combine a data frame with one containing comma-separated values?

Data frame A, which has 2 variables (mouse gene name and Ensemble ID):

head(dfA)
#     gene         EnsembleID    
# 1  Hoxa9 ENSMUSG00000038227               
# 2 Zfp128 ENSMUSG00000060397 
# 3 Zfp853 ENSMUSG00000093910 
# 4  Nr1h2 ENSMUSG00000060601               
# 5  Nr1h3 ENSMUSG00000002108               
# 6  Nr1h4 ENSMUSG00000047638               

Data frame B has 3 variables (orthogroup, mouse gene Ensemble ID, and zebrafish gene Ensemble ID). Noted that in the picture below the column names are Mus_musculus.GRCm39.pep.all and Danio_rerio.GRCz11.pep.all, respectively.

enter image description here

I want to add a new column for "zebrafish Ensemble ID" into A as referenced information in B. So, the final data I need looks like this:
enter image description here

The point is, in data frame B, both Ensemble IDs' columns contain comma-separated values. This file is originally from the outputs of orthofinder.

If there is only a single value in each row, I understand that we can easily combine those data frames using merge function in R. The length of data frame A is 1700 (= 1700 genes), so I would like to use some codes instead of search&copy&paste manually. Any suggestions or ideas would be really appreciated!

frame python R orthofinder comma data • 384 views
ADD COMMENT
2
Entering edit mode

You can use tidyr::separate_rows to split the DF so it has one ID per entry. See: https://tidyr.tidyverse.org/reference/separate_rows.html

ADD REPLY
0
Entering edit mode

Thank you very much. I didn't think of it that way, and need to learn more. I was able to deal with this issue using the below answer.

ADD REPLY
5
Entering edit mode
7 weeks ago
rioualen ▴ 570

For dataframe manipulation, in general, you should look into the dplyr and tidyr packages, they offer endless possibilities if you learn to manipulate them (lots of practice will help). A good and concise cheatsheet is available here.

Regarding this problem in particular, something like this should work:

library(dplyr)
library(tidyr)

dfA <- data.frame(gene = c("a", "b", "c", "d", "e"),
                  mus_id = c("mus001", "mus002", "mus003", "mus004", "mus005"))

dfB <- data.frame(mus_id = c("mus001", "mus002", "mus003,mus002", "mus004"),
                  zeb_id = c("dar001", "dar002,dar004", "dar003,dar002", "dar004"))

dfC <-  dfB %>%
  tidyr::separate_rows(zeb_id, sep = ",") %>%
  tidyr::separate_rows(mus_id, sep = ",") %>%
  dplyr::group_by(mus_id) %>%
  dplyr::summarise(zeb_id = paste0(sort(unique(na.omit(zeb_id))), collapse = ','))

dfD <- dfA %>%
  dplyr::left_join(dfC)
ADD COMMENT
0
Entering edit mode

That works perfectory. Thank you so much!

ADD REPLY

Login before adding your answer.

Traffic: 1193 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6