Question

Single cell RNAseq - how many cells enough to determine DE?

4

Entering edit mode

10.5 years ago

Genosa ▴ 160

Hello,

I am new to this forum. I am interested in SS-RNAseq and would like to know what statistical method should be used to calculate the number of cells (minimum) needed to sequence per condition.

In a hypothetical experiment, a study needs to compare DE in population of epithelial cells infected with papilloma virus versus uninfected epithelial cells isolated from the same individual. Single cells are collected for analysis.

How many cells are needed? And how can that be calculated?

It seems that existing published literature on ssRNA-seq does not provide much details on how they derived their sample size but can range from as little as 25 to few hundreds.

Is edgeR RNASeqPower be a suitable tool for this purpose?

Thank you

RNA-Seq single-cell-sequencing • 8.9k views

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by Genosa ▴ 160

Ram · Answer 1 · 2015-01-08

5

Entering edit mode

10.5 years ago

Devon Ryan 105k

Your only options are things like RNASeqPower or Scotty. There's otherwise no a priori method to accurately do a power calculation (and even those tools are limited by how similar your experiment is to published ones).

Edit: BTW, published studies likely chose their sample sizes based on (1) budgetary constraints, (2) logistic constraints, and (3) a pilot study they never mentioned.

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by Devon Ryan 105k

2

Entering edit mode

Hi,

I wrote Scotty and I think the part of it that estimates how deep to sequence it will break if you try to put single cell data into it. (I think it's broken generally because the server did not make it on the moving van to Utah when the lab moved, but that's a different story).

The reason it will break is that Scotty assumes that sampling noise can be modeled with a Poisson distribution. Single cell data is usually low complexity and thus over sequenced. The variance (dispersion( due to counting noise is usually higher than its mean as would be expected in a Poisson model. (i.e. an x-y plot of two cell's expression is really wide at low counts). This is because a lot of molecules are lost in the selection of what to sequence.

I haven't used edgeR RNASeqPower specifically but edgeR uses a similar model to the model in Scotty where Poisson and biological variance are broken apart so I suspect it will break too.

What you can do is to just do a traditional power calculation by hand. It is dead easy if you have some single cell data that you can use as a model. Normalize the data. Pick a gene you are interested in that you think is typical and has >~20 or so reads aligned to it. Then you have a set of expression values. Calculate the mean expression and the variance of those numbers. Then plug it into a calculator like this one and see what you get:

http://www.stat.ubc.ca/~rollin/stats/ssize/n2.html

"The calculations are the customary ones based on normal distributions. " which means they're not exactly the ones we used in Scotty because they're a little off for a low number of replicates. But that will give you ballpark power if you sequence your data at that depth. One of the findings in our Scotty paper (maybe the most important one other than that existing experiments are ridiculously underpowered) is that standard power analysis and stats from Intro Stats work just fine.

So do that for a few genes and you should get an idea of what your data will give you.

It is also a very fast moving field with continuous development of new protocols for low input RNA Seq (much of it here at the good old Broad Technology Labs). So the dispersion you see in published datasets is going to be more than you see in your data because whatever protocol you use is going to be better than what they were using a year ago.

In general, for differential expression in regular (multi cell) cell lines it takes about 8 reps to get most of your 2X fold changes. This is ballpark estimate and wouldn't pass peer review but that's what I've got. The single cell data I've seen is more dispersed than cell lines. So I can tell you that you need more than 8 cells per condition.

We have had good luck here with a general model of more cells with lower coverage, so you should try to balance the cost of your experiment that way.

One thing to note is that even if you have single cells ideally you should still have biological replicates. That is, all of your single cells for each condition shouldn't come from the same biological sample. You should have more than one infection.

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by Michele Busby ★ 2.2k

0

Entering edit mode

Is Scotty named after Montgomery Scott from the USS Enterprise?

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 9.5 years ago by informatics bot ▴ 760

1

Entering edit mode

We need more power!!!

Yes.

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 9.5 years ago by Michele Busby ★ 2.2k

Ram · Answer 2 · 2015-01-08

0

Entering edit mode

10.5 years ago

Israel Barrantes ▴ 790

There's also Monocle (from the Trapnell group), although I haven't tried it yet.

As for the replicates, I think that this question is not really an statistical but a conceptual one, given that the whole paradigm of single cell studies is that each cell behaves differently and there are expression biases between cells due to this -so, if you consider neighboring cells as replicates, you are somehow going against the individuality paradigm.

ADD COMMENT • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by Israel Barrantes ▴ 790

0

Entering edit mode

Yes and no. The theoretical question posed is nicely handled by single-cell sequencing (it has advantages over standard pooling for these purposes) and using the "every cell is a snowflake" paradigm wouldn't help in that case.

ADD REPLY • link updated 3.3 years ago by Ram 45k • written 10.5 years ago by Devon Ryan 105k