Question: DESeq2: unable to allocate memory error
0
gravatar for elizabeth.h
2.4 years ago by
elizabeth.h10
elizabeth.h10 wrote:

I'm running DESeq2 on small RNA sequencing data. I constructed a .csv containing the raw count of each unique small RNA sequence across all my datasets. In the past I've successfully used DESeq2 on similar data, but this time my file size is bigger: my CSV is >2Gb. I'm running into memory errors using x64-bit RStudio on a Windows machine with 64Gb RAM.

This is all I'm trying to do right now:

library(DESeq2)
sRNA <- read.csv("deseq_input_all.txt", header=T, row.names=1)
coldata <- read.csv("deseq2/coldata.csv", header=T, row.names=1)
dds <- DESeqDataSetFromMatrix(countData = sRNA, colData = coldata, design = ~ group)
dds <- DESeq(dds)

However, at the DESeq() stage RStudio maxes out the memory and stops. Am I doing something silly with the code above, or is my data too big for this analysis? Is it worthwhile running it on a Linux machine or running R separate from RStudio?

Any tips or advice is appreciated.

EDIT: There are 34 million rows of data.

dim(sRNA)
> [1] 34760467       21
rna-seq R software error • 1.0k views
ADD COMMENTlink modified 2.4 years ago by andrew.j.skelton735.4k • written 2.4 years ago by elizabeth.h10

I'd tend to agree with @Carlo Yague's first point... 2GB of raw counts seems odd to me. Can you show the output of dim(sRNA)

ADD REPLYlink written 2.4 years ago by andrew.j.skelton735.4k

Unless you have thousands of samples, the counts table should not be 2 GB.

ADD REPLYlink written 2.4 years ago by igor7.1k

There are 34 million rows (unique sequences) in the count table.

ADD REPLYlink written 2.4 years ago by elizabeth.h10
1

I have a feeling that you "counted" unique reads in a fastq file. That's not going to be useful for you. Align those to a genome, generate counts with featureCounts or htseq-count on the resulting alignments and then use the counts from that. You'll suddenly find that you only have a few tens of thousands of rows, which makes rather more biological sense.

ADD REPLYlink written 2.4 years ago by Devon Ryan86k

Exactly. Just to clarify, the counts table should have samples as columns and genes as rows, so 20k-50k rows depending on annotation (for human/mouse).

ADD REPLYlink written 2.4 years ago by igor7.1k

Correct. For smallRNAs this number of "genes" might be a bit different, but that's the gist. BTW, if you're mostly interested in a single type of small RNA then there are dedicated programs for most of them (e.g., mirDeep).

ADD REPLYlink written 2.4 years ago by Devon Ryan86k
2
gravatar for Carlo Yague
2.4 years ago by
Carlo Yague4.3k
Belgium
Carlo Yague4.3k wrote:

I have two suggestions :

1) >2Gb is really big. Are u sure your data is what you think it is ?

2) before calling DESeq(), filter out rows with low expression to reduce the size of your dataset. For instance :

dds <- dds[ rowSums(counts(dds)) > 1, ]

see this tutorial for more information.

ADD COMMENTlink written 2.4 years ago by Carlo Yague4.3k

Thanks for that! I'll try it out.

This data represents all 34 million unique sequences present in at least one dataset - and it hasn't been filtered based on number of reads/count yet.

I've run this analysis in the past using the criteria that a sequence had to be present in at least 10 reads (count of >=10) in order to be included in the count table, however I want to compare the size factors between >10 reads and >1 read count tables to make sure that they're roughly equivalent.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by elizabeth.h10
1
gravatar for andrew.j.skelton73
2.4 years ago by
London
andrew.j.skelton735.4k wrote:

If you truly have 2GB of integer count data, firstly I'd make sure that what you're counting is gene-like, and secondly, I'd use Limma Voom, as DESeq2 won't scale up to large sample sizes very well.

edit: 21 samples... What exactly are the features you're counting?

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by andrew.j.skelton735.4k
1

I'm (attempting) using DESeq2 to identify DE small non-coding RNAs (e.g. miRNAs) between samples - the samples are 3 biological replicates of 7 tissues/conditions.

From what I understand based on questions I've found on Biostars about DESeq2, it's possible to use miRNA count data (for example) to identify DE miRNAs. In this case, I'm running a count table with all small RNA sequences in the dataset, and then identifying DE miRNAs from that. The counts represent the number of reads for each unique sequence in the dataset.

The process works when I use a smaller dataset (where I filter by count size >=10) and produces reasonable-looking results. I wanted to test now whether I get similar results without filtering by count size - which is why my count table is so large in this case.

ADD REPLYlink written 2.4 years ago by elizabeth.h10

You first need to assign all reads to specific miRNAs.

There are a few tools you can use to automate the process, such as:

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by igor7.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1253 users visited in the last hour