Question

Read in UMI count table from GEO for use in Seurat pipeline

0

Entering edit mode

3.8 years ago

pv6077 • 0

I'm an R beginner trying to learn how to analyze single-cell seq data using Seurat tools. I wanted to try to work with a published Drop-seq UMI count dataset available via GEO:

https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2781nnn/GSM2781556/suppl/GSM2781556_Dropseq_120000_umi.txt.gz

I've encountered two errors trying to load this data into R: 1. Using data <- read.table(file = /path/to/file, sep = '\t') results in a memory error Cannot allocate vector of size 125 Kb. To work around this I've tried to use memory.limit() to try to get around this (I have 8Gb of RAM) but R always crashes.

To get around the memory issue another way I've tried to use read.table.ffdf using various combinations of parameters i.e.row.names = 1, header = TRUE but each results in an error. i.e. attempt to set 'rownames' on an object with no dimensions and more columns than column names.

I think the issue comes down to the fact that I do not know what this data file looks like and because it is very large data file (~4Gb) I haven't been able to open it to view it myself, even using LTFviewer. So does anybody have any tips on how to load in large single-cell seq UMI count files for use in Seurat pipeline? Would using read.table.ffdf work if I found the correct parameters to load in the file or is there a better way to go about this all together?

Thank you!

seurat R • 3.5k views

ADD COMMENT • link 3.8 years ago by pv6077 • 0

1

Entering edit mode

Maybe this helps:

> library(data.table)
> tt <- fread("https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2781nnn/GSM2781556/suppl/GSM2781556_Dropseq_120000_umi.txt.gz")
 [100%] Downloaded 40425751 bytes...

Warning message:
In fread("https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM2781nnn/GSM2781556/suppl/GSM2781556_Dropseq_120000_umi.txt.gz") :
  Detected 120000 column names but the data has 120001 columns (i.e. invalid file). Added 1 extra default column name for the first column which is
 guessed to be row names or an index. Use setnames() afterwards if this guess is not correct, or fix the file write command that created the file t
o create a valid file.
> 
> dim(tt)
[1]  16272 120001
> names(tt) %>% head
[1] "V1"     "Cell_1" "Cell_2" "Cell_3" "Cell_4" "Cell_5"
> head(tt[, 1:10])
                    V1 Cell_1 Cell_2 Cell_3 Cell_4 Cell_5 Cell_6 Cell_7 Cell_8
1:               128up      0      0      0      0      0      0      2      0
2:       14-3-3epsilon     11      0      7      7      4      6     16      3
3:          14-3-3zeta     20      1     10     13     10      6     30      0
4:               140up      0      0      0      0      0      0      0      0
5: 18SrRNA-Psi:CR41602      0      0      0      0      0      0      0      0
6:                 18w      0      0      0      0      0      0      0      0
   Cell_9
1:      0
2:      5
3:      5
4:      0
5:      0
6:      0

data.tables tend to be a bit more manageable for memory issues.

ADD REPLY • link 3.8 years ago by Friederike 8.9k

1

Entering edit mode

I've put the results of the following lines of code here, let me know if that works for you.

sparse.mat <- Matrix(as.matrix(tt[, -1, with=FALSE]), sparse=TRUE)
DropletUtils::write10xCounts("GSM2781556", x = sparse.mat, gene.id= tt$V1)

To be precise -- the link above allows you to download a tar archive. You probably need to untar it. Then open up R and read in the three files from that tar archive with any function meant to read in 10X CellRanger data, e.g. DropletUtils::read10XCounts():

sce <- DropletUtils::read10xCounts("GSM2781556", col.names=TRUE)

ADD REPLY • link 3.8 years ago by Friederike 8.9k

0

Entering edit mode

This worked perfectly and avoided the memory issues I was having! Thank you for the help!

ADD REPLY • link 3.8 years ago by pv6077 • 0