How can I join multiple columns from different datasets together in R?
2
1
Entering edit mode
2.9 years ago
valentinavan ▴ 50

I have many different datasets from different samples. Each of these has two columns: A - species list and B - counts. Not all of these species are present in each of my samples therefore I would like to merge these columns together and when one of these species is not present it will be NA/zero.

In summary my new merged dataset needs to have: column A with all the species list and the following columns with the species counts for each of my samples.

I have done this with "full.join" but when I have many datasets to merge, this is not convenient anymore. I guess I would need to build a loop but I am not really good at it yet and I'd need some help.

Any help would be more than appreciated, I am stuck.

Thanks in advance

r • 1.9k views
ADD COMMENT
1
Entering edit mode

df code taken from Arup Ghosh 's post:

with dplyr:

> x %>% 
+     left_join(y) %>% 
+     left_join(z) %>% 
+     replace(is.na(.), 0)
Joining, by = "gene_symbol"
Joining, by = "gene_symbol"
  gene_symbol sample1 sample2 sample3
1           a       1       0       0
2           b       2       4       0
3           c       3       5       0
4           f       4       0       7
5           g       5       0       8

with sqldf:

> library(sqldf)

> sqldf('select x.*, y.sample2, z.sample3 from x 
        left join y on x.gene_symbol=y.gene_symbol 
        left join z on x.gene_symbol=z.gene_symbol')
ADD REPLY
0
2
Entering edit mode
2.9 years ago

Create a list of dataframes and use Reduce. PS. Make sure the name of the first column is the same in all the data frames and the second column name is unique(sample name).

x <- data.frame(gene_symbol = c("a","b","c","f","g"), sample1 = 1:5, stringsAsFactors=FALSE)
y <- data.frame(gene_symbol = c("b","c","d"), sample2 = 4:6, stringsAsFactors=FALSE)
z <- data.frame(gene_symbol = c("f","g","k","l"), sample3 = 7:10, stringsAsFactors=FALSE)

df_list <- list(x,y,z)

merged.df<-Reduce(function(...) merge(..., all=T), df_list)

merged.df[is.na(merged.df)]<-0

merged.df

  gene_symbol sample1 sample2 sample3
1           a       1       0       0
2           b       2       4       0
3           c       3       5       0
4           d       0       6       0
5           f       4       0       7
6           g       5       0       8
7           k       0       0       9
8           l       0       0      10

Source: https://stackoverflow.com/a/34393416/3548799

ADD COMMENT
0
Entering edit mode

Thank you!

ADD REPLY
1
Entering edit mode
2.9 years ago

Arup's solution is good but I find Reduce a bit cryptic. The solution below is pretty much a translation of "merge these datasets one after the other":

# Some dummy data:
x <- data.frame(gene_symbol = c("a","b","c","f","g"), sample1 = 1:5, stringsAsFactors=FALSE)
y <- data.frame(gene_symbol = c("b","c","d"), sample2 = 4:6, stringsAsFactors=FALSE)
z <- data.frame(gene_symbol = c("f","g","k","l"), sample3 = 7:10, stringsAsFactors=FALSE)


df_list <- list(x, y ,z)

mrg <- df_list[[1]]
for(i in 2:length(df_list)) {
    mrg <- merge(mrg, df_list[[i]], all= TRUE, by= 'gene_symbol')
}

mrg[is.na(mrg)] <- 0
ADD COMMENT
0
Entering edit mode

Thank you!

ADD REPLY

Login before adding your answer.

Traffic: 2503 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6