correct way of analyzing cell proportions in singlecell data
1
2
Entering edit mode
3.1 years ago

Hello

In Seurat there is a function to take the proportions of each cell identity so you can easily plot it with ggplots or something similar. However, most scRNA datasets I have seem (I mostly reanalyze data) have different sample sizes for each condition. So I'm sure just taking the proportions of cells might not be adequate. I believe you would need to normalize this. The first thing that comes to mind is dividing the number of cell identities by the number of conditions, but it still doesn't make much sense I guess, as sometimes the same conditions may have a high variation of cell identities too. Here the authors plot it by log2 of relative proportions, which I believe it is Z-score, but still it is a bit weird to me, as they have different numbers of samples in each status.

I couldn't find any Seurat vignette addressing this. Any solutions? Does my concern make sense?

single-cell RNaseq • 8.4k views
0
Entering edit mode

Hi, this is a very important and helpful question. However, I am a little unsure of why we can't just perform a standard Fischer's exact test or chi-square test in this regard. Let us say I have 5 clusters in condition A and 5 clusters in condition B. Can't I just compare the proportion of cells in each cluster across condition (even if the sample sizes are different) and ask whether the proportion difference I am observing is significant or not? Sorry if this question is too dumb. I would really appreciate any insights with this.

10
Entering edit mode
3.1 years ago

To compare cell proportions between conditions, I've found using a monte-carlo/permutation test to be the most sensible and robust way. The null hypothesis you want to test against is that the difference in cell proportions for each cluster between conditions is just a consequence of randomly sampling some number of cells for sequencing for each condition. To generate this null distribution, you "pool" the cells between both samples together, and then you randomly segregate the cells back into the two conditions maintaining original sample sizes. You then recalculate the proportional difference between the two conditions for each cluster, and compare that to the observed proportional difference for each cluster. I tend to take the log2 difference in proportions since it's a more sensible scale. Repeat this process about 10,000 times, and the p-value would be the number of simulations where the simulated proportional difference was as or more extreme than observed (plus one) over the total number of simulations (plus one).

Since I found myself having to do this so many times, I made a little R library for myself that takes a seurat object, and will do a permutation test for p-values (and adjusted p-values), as well as generate a plot with the observed proportional difference and a bootstrapped confidence interval for each cluster.

https://github.com/rpolicastro/scProportionTest

0
Entering edit mode

based, wish I could like your response 10 times If you have an article with it, please let me know so I can cite it

0
Entering edit mode

Very nice implementation!

How do you feel about the log2FD value, is 0.58 the lowest value we could use? I know it goes back to having a Fold Change of 1.5 but it seems to me that this value can be kind of arbitrary sometimes. I have used your library to my data and I'm testing some obs_log2FD values.

Thanks a lot for posting it!! Cheers!

1
Entering edit mode

I'm glad you've found some use for it!

I personally use a Log2FC of 1, corresponding to a doubling of abundance. I know it's sort of arbitrary but I consider this a big enough magnitude to likely be real and interesting. If you want to use a Log2FC of 0.58 I would also consider visually the bootstrapped CI, since it represents your certainty for the FC value too.

0
Entering edit mode

Hi @rpolicastro, To pick up on this question I want to ask for a clarification. I did this analysis but not sure whether the plot shows significance difference of sample1 compared to sample 2. In my case, the proportion of cell type in different affected status are as below But when I did permutation I expected to see sth not that different. But the result is as below So, if the result is comparison of sample1=ALS compared to sample 2=control is it true that most off the subpopulation are overrepresented in ALS in contrast to the first plot? For example, in the first plot the OL population are overrepresented in ALS but in the second it's not. I appreciate any help

1
Entering edit mode

You should open your own question for this rather than trying to piggyback on another. You're more likely to get a response that way and it helps keep the site organized.

0
Entering edit mode
1. Why is the log2 scale more sensible? Just so that the numbers of the scale look nicer or a more important reason?
2. By "log2 difference in proportions" do you mean the difference between the log2 of the mean proportion of Condition1 vs the log2 of the mean proportion of Condition2 ?
3. Can one adapt the library to work with an object not from Seurat (i.e. a matrix of proportions for Control, and another matrix with proportions for Treatment)?