Question: Single cell RNAseq - how many cells enough to determine DE?
gravatar for Genosa
5.9 years ago by
Genosa100 wrote:


I am new to this forum. I am interested in SS-RNAseq and would like to know what statistical method should be used to calculate the number of cells (minimum) needed to sequence per condition. 

In a hypotheical experiment, a study needs to compare DE in population of epithelial cells infected with papilloma virus versus uninfected epithelial cells isolated from the same individual. Single cells are collected for analysis. 

How many cells are needed? And how can that be calculated?

It seems that existing published literature on ssRNA-seq does not provide much details on how they derived their sample size but can range from as little as 25 to few hundreds. 

Is edgeR RNASeqPower be a suitable tool for this purpose?

Thank you 

ADD COMMENTlink modified 5.9 years ago by Israel Barrantes790 • written 5.9 years ago by Genosa100
gravatar for Devon Ryan
5.9 years ago by
Devon Ryan97k
Freiburg, Germany
Devon Ryan97k wrote:

Your only options are things like RNASeqPower or Scotty. There's otherwise no a priori method to accurately do a power calculation (and even those tools are limited by how similar your experiment is to published ones).

Edit: BTW, published studies likely chose their sample sizes based on (1) budgetary constraints, (2) logistic constraints, and (3) a pilot study they never mentioned.

ADD COMMENTlink modified 5.9 years ago • written 5.9 years ago by Devon Ryan97k


I wrote Scotty and I think the part of it that estimates how deep to sequence it will break if you try to put single cell data into it.  (I think it's broken generally because the server did not make it on the moving van to Utah when the lab moved, but that's a different story).

The reason it will break is that Scotty assumes that sampling noise can be modeled with a Poisson distribution.  Single cell data is usually low complexity and thus over sequenced.  The variance (dispersion( due to counting noise is usually higher than its mean as would be expected in a Poisson model. (i.e. an x-y plot of two cell's expression is really wide at low counts).  This is because a lot of molecules are lost in the selection of what to sequence.

I haven't used edgeR RNASeqPower specifically but edgeR uses a similar model to the model in Scotty where Poisson and biological variance are broken apart so I suspect it will break too.  

What you can do is to just do a traditional power calculation by hand.  It is dead easy if you have some single cell data that you can use as a model.   Normalize the data.  Pick a gene you are interested in that you think is typical and has >~20 or so reads aligned to it. Then you have a set of expression values. Calculate the mean expression and the variance of those numbers.  Then plug it into a calculator like this one and see what you get:

"The calculations are the customary ones based on normal distributions. " which means they're not *exactly* the ones we used in Scotty because they're a little off for a low number of replicates.  But that will give you ballpark power if you sequence your data at that depth.  One of the findings in our Scotty paper (maybe the most important one other than that existing experiments are ridiculously underpowered) is that standard power analysis and stats from Intro Stats work just fine.

So do that for a few genes and you should get an idea of what your data will give you.

It is also a very fast moving field with continuous development of new protocols for low input RNA Seq (much of it here at the good old Broad Technology Labs).  So the dispersion you see in published datasets is going to be more than you see in your data because whatever protocol you use is going to be better than what they were using a year ago.  

In general, for differential expression in regular (multi cell) cell lines it takes about 8 reps to get most of your 2X fold changes. This is ballpark estimate and wouldn't pass peer review but that's what I've got.  The single cell data I've seen is more dispersed than cell lines.  So I can tell you that you need more than 8 cells per condition.

We have had good luck here with a general model of more cells with lower coverage, so you should try to balance the cost of your experiment that way.

One thing to note is that even if you have single cells ideally you should still have biological replicates.  That is, all of your single cells for each condition shouldn't come from the same biological sample.  You should have more than one infection.


ADD REPLYlink modified 5.9 years ago • written 5.9 years ago by Michele Busby2.1k

Is Scotty named after Montgomery Scott from the USS Enterprise?

ADD REPLYlink written 4.9 years ago by informatics bot640

We need more power!!!


ADD REPLYlink written 4.9 years ago by Michele Busby2.1k
gravatar for Israel Barrantes
5.9 years ago by
Israel Barrantes790 wrote:

There's also Monocle (from the Trapnell group), although I haven't tried it yet.

As for the replicates, I think that this question is not really an statistical but a conceptual one, given that the whole paradigm of single cell studies is that each cell behaves differently and there are expression biases between cells due to this -so, if you consider neighboring cells as replicates, you are somehow going against the individuality paradigm.

ADD COMMENTlink written 5.9 years ago by Israel Barrantes790

Yes and no. The theoretical question posed is nicely handled by single-cell sequencing (it has advantages over standard pooling for these purposes) and using the "every cell is a snowflake" paradigm wouldn't help in that case.

ADD REPLYlink written 5.9 years ago by Devon Ryan97k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1067 users visited in the last hour