Question

Merging multiple single cell gene expression matrices (in .txt format) into one .txt/.csv/df

1

Entering edit mode

3.3 years ago

DrAcula ▴ 140

Dear All,

First of all my apologies if this has been answered somewhere else and I haven't understood it properly, but I am having a somewhat irritating problem with merging single-cell RNA seq datasets.

I know that across donors it's probably best to perform integration, but I wanted to see what happens when one performs merging.

So, the dataset I am trying to merge contains 64 .txt files (dataset is the 'supplementary file' data: GSE120180_RAW.tar, which can be found here: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE120180).

as an example, running:

dataset1 <- read.delim("~/GSM3395418_CE01_I1_UMI_count.txt", sep="", header = TRUE)
dataset1[c(1:10), 1:3]

yields:

      Gene YF1_I_sc.1 YF1_I_sc.2
1      A2M          0          0
2    A2ML1          0          0
3  A3GALT2          0          0
4   A4GALT          0          0
5     AAAS          2          0
6     AACS         15         12
7    AADAC          0          0
8  AADACL2          0          0
9  AADACL3          1          0
10 AADACL4          0          0

Now, how can I merge all of these datasets, based on the "Gene" column, assuming:

Columns across multiple files may be named similarly, for example there may be a "YF1_I_sc.1" in another file.
Some files may have different "numbers" of total genes, for example the length of the gene column may vary across samples.
If the code you suggest generates NAs how to remove the NAs and replace with 0s.

It would be helpful to create a list I feel, so I ran this to create a list containing 64 dataframes:

file_list <- list.files() #where list.files points towards a wkdir containing the 64 .txt files
list_of_files <- lapply(file_list, read.delim)

however, when I run merged_nhp<- merge_all(list_of_files, by = "Gene" I get only one file containing 97 cells, wheras there are ~5k cells, across the 64 experiments.

Thanks for the help!

Best,

Fahd

RNA-Seq R sequencing • 2.3k views

ADD COMMENT • link updated 3.3 years ago by rpolicastro 13k • written 3.3 years ago by DrAcula ▴ 140

score 1 · Answer 1 · 2021-01-04

1

Entering edit mode

3.3 years ago

rpolicastro 13k

Your goal is a little unclear, but it seems like you just want to analyze each cell from every sample like they were all from the same sample. In that case you should just merge the list of seurat objects.

library("Seurat")
library("tidyverse")

file_list <- list.files(...)
names(file_list) <- basename(file_list)

seu <- imap(file_list, function(x, y) {
  cts <- x %>%
    read_tsv %>%
    column_to_rownames("Gene") %>%
    as.matrix %>%
    CreateSeuratObject(project=y)
  return(cts)
})

seu <- merge(seu[[1]], seu[2:length(seu)])

ADD COMMENT • link 3.3 years ago by rpolicastro 13k

0

Entering edit mode

Thanks for the comment @rpolicastro ! When I run your code, I get the following error:

 Error in column_to_rownames(., "Gene") : is.data.frame(.data) is not TRUE

Traceback:

13.
stop(simpleError(msg, call = if (p <- sys.parent(1L)) sys.call(p))) 
12.
stopifnot(is.data.frame(.data)) 
11.
column_to_rownames(., "Gene") 
10.
function_list[[i]](value) 
9.
freduce(value, `_function_list`) 
8.
`_fseq`(`_lhs`) 
7.
eval(quote(`_fseq`(`_lhs`)), env, env) 
6.
eval(quote(`_fseq`(`_lhs`)), env, env) 
5.
withVisible(eval(quote(`_fseq`(`_lhs`)), env, env)) 
4.
read_tsv %>% column_to_rownames("Gene") %>% as.matrix %>% CreateSeuratObject(project = y) 
3.
.f(.x[[i]], .y[[i]], ...) 
2.
map2(.x, vec_index(.x), .f, ...) 
1.
imap(list_of_files, function(x, y) {
    cts <- x
    read_tsv %>% column_to_rownames("Gene") %>% as.matrix %>% 
        CreateSeuratObject(project = y) ...

My current working directory has the 64 files in it. Any Idea why the function requires the data to be a dataframe?

ADD REPLY • link 3.3 years ago by DrAcula ▴ 140

0

Entering edit mode

Try loading just one file in to see if it returns a proper data.frame. cts <- read_tsv(list_of_files[1]). If the files are in your working directory also make sure you don't have the 3 dots in list.files(). I just had that as a reminder to put that path where the files are stored.

ADD REPLY • link 3.3 years ago by rpolicastro 13k

0

Entering edit mode

Ok, so sorry I didn't read the first two lines of your first comment, just to elaborate, my goal is to perform non-linear multidimensional reduction, on this dataset GSE120180, via either merging or integrating each of the donors. This dataset consists of 16 donors that were single-cell sequenced in what appears to be batches of 4 each time (it's unclear whether these are technical replicates or not, but based on the total cell number in the paper, I am assuming that the sequencing was performed in batches of 4).

My first goal is to perform merging, where my assumption is that donor heterogeneity will be compensated due to strong cellular phenotypes as it is unclear whether the authors used integration or mergeing. In any case, this will be useful, because each of the donor 'batches' will have to be merged forming 16 files/dfs prior to generating seurat objects and integrating.

Now coming back to your solution, I ran cts <- list_of_files[[1]] because when you run list_of_files <- lapply(file_list, read.delim) you get a list of 64 dfs. But just to check anyway, I ran cts <- list_of_files[[1]] and then cts[c(1:10), 1:3] to get the following:

      Gene YF1_I_sc.1 YF1_I_sc.2
1      A2M          0          0
2    A2ML1          0          0
3  A3GALT2          0          0
4   A4GALT          0          0
5     AAAS          2          0
6     AACS         15         12
7    AADAC          0          0
8  AADACL2          0          0
9  AADACL3          1          0
10 AADACL4          0          0

Also is.data.frame(cts) yields a TRUE I would like to point out, it's my understanding of the list.files here is a character vector of file names, while list_of_files is a list of dfs (please correct me if I am wrong). Thanks alot, im sure you have better things to do, but in the past I have been manually creating individual seurat objects and then merging them all using merge in seurat, and quite honestly I think now that thats just dumb. Im hoping your solutions will work and allow me to modify it to merge large datasets, organized as lists of dataframes (dfs) with ease. Any suggestions? Thanks again for the help.

ADD REPLY • link 3.3 years ago by DrAcula ▴ 140

1

Entering edit mode

Alright, so I think I see what happened. According to what you said you used list_of_files, which is a list of data.frames as input to imap. I was starting with a named vector of file names. If you want to start with list_of_files you can run this slightly modified code.

seu <- map(list_of_files, function(x) {
  cts <- x %>%
    column_to_rownames("Gene") %>%
    as.matrix %>%
    CreateSeuratObject(project=x)
  return(cts)
})

seu <- merge(seu[[1]], seu[2:length(seu)])

Also don't worry about asking questions! As long as you are putting thought into them no one minds.

ADD REPLY • link 3.3 years ago by rpolicastro 13k

0

Entering edit mode

WOW. That was dumb of me. I just realized what was going on... just to clarify the sequence of events:

file_list <- list.files("~/Rawdata") #where list.files points towards a dir containing 'x number of' .txt files
names(file_list) <- basename(file_list)

seu <- imap(file_list, function(x, y) {  #function to create a list of Seurat objects called seu of all items in file_list
  cts <- x %>%
    read_tsv %>%
    column_to_rownames("Gene") %>%
    as.matrix %>%
    CreateSeuratObject(project=y)
  return(cts)
})
seu <- merge(seu[[1]], seu[2:length(seu)]) #merging all seurat objects in list seu

Thanks alot @rpolicastro ! Your solution works like a charm. It Also allows you to define which column is the gene list. This helps when working with Ensembl Ids etc. Cheers have a great day!

ADD REPLY • link 3.3 years ago by DrAcula ▴ 140

0

Entering edit mode

Your comments in the code look good. Glad I could help!

ADD REPLY • link 3.3 years ago by rpolicastro 13k

0

Entering edit mode

Ok, so this may be incorrect but I think it may be a solution.

Using the Reduce base R function:

merged.nhp = Reduce(function(...) merge(..., all=T), list_of_files)
tail(merged.nhp)

However I know that Reduce() combines from the left, and since the first column is a gene list, its easy to combine I guess. But what if the gene list is within the dataset? More control would be favorable I guess....

ADD REPLY • link 3.3 years ago by DrAcula ▴ 140