Question

Replace Drugbank IDs with Drug name [R]

0

Entering edit mode

3.0 years ago

arissaoulidis • 0

Hello friends,

I have a dataset of genes associated with drugs derived from DrugBank. I wish to simply translate all the drugbank IDs to drug names readable by a human.

The drugbank vocabulary (.csv) looks like this: [DBvocabulary.csv]

DrugBank.ID,Common.name
DB00001,Lepirudin
DB00002,Cetuximab
DB00003,Dornase alfa
DB00004,Denileukin diftitox
DB00005,Etanercept
DB00006,Bivalirudin

My dataset (.csv) has 15 columns but the important ones are:

[all_ph_active.csv]

Gene.Name,DrugBank.ID   
F8,DB09130
TCN2,DB00200
LDLR,DB09270; DB11251; DB14003
ALB,DB00070; DB00137; DB00159; DB00162; DB00214

As you can see some genes are linked to multiple or even hundreds of drugs. the multiple Drug IDs are separated by semicolons, in the same comma-delimited "column" The R studio match or merge function only work for the first identifier in each column, thus effectively deleting the remainder in the same column "cell".

Any advice is welcome, thanks in advance!

r dictionary drugbank • 1.0k views

ADD COMMENT • link updated 3.0 years ago by rpolicastro 13k • written 3.0 years ago by arissaoulidis • 0

0

Entering edit mode

There are several ways (like join etc). But I would suggest you sqldf package in R. However, these can be done outside R as well.

ADD REPLY • link 3.0 years ago by cpad0112 21k

score 0 · Answer 1 · 2021-04-03

Your data. I slightly modified df2 to have some matches.

df1 <- structure(list(DrugBank.ID = c("DB00001", "DB00002", "DB00003", 
"DB00004", "DB00005", "DB00006"), Common.name = c("Lepirudin", 
"Cetuximab", "Dornase alfa", "Denileukin diftitox", "Etanercept", 
"Bivalirudin")), class = "data.frame", row.names = c(NA, -6L))

df2 <- structure(list(Gene.Name = c("F8", "TCN2", "LDLR", "ALB"), DrugBank.ID = c("DB00001", 
"DB00002", "DB00003; DB00004; DB14003", "DB00005; DB00006; DB00159; DB00162; DB00214"
)), class = "data.frame", row.names = c(NA, -4L))

Tidyverse answer

library("tidyverse")

df <- df2 %>%
  separate_rows(DrugBank.ID, sep="; ") %>%
  left_join(df1, by="DrugBank.ID") %>%
  group_by(Gene.Name) %>%
  summarize(across(c(DrugBank.ID, Common.name), ~str_c(str_replace_na(.x), collapse=";")))

> df
# A tibble: 4 x 3
  Gene.Name DrugBank.ID                          Common.name                    
  <chr>     <chr>                                <chr>                          
1 ALB       DB00005;DB00006;DB00159;DB00162;DB0… Etanercept;Bivalirudin;NA;NA;NA
2 F8        DB00001                              Lepirudin                      
3 LDLR      DB00003;DB00004;DB14003              Dornase alfa;Denileukin diftit…
4 TCN2      DB00002                              Cetuximab

You can leave out the group_by and summarize commands if you want to leave the data in proper long format.