Question

scRNA-seq: How does cell number in clusters affect the number of DE genes?

0

Entering edit mode

12 months ago

cfa24357 • 0

Hi, I'm new to scRNA-seq and bioinformatics, and have some questions, which I presume might be rather basic but would hopefully help out others like me who are just starting out - couldn't find any past questions here on this!

Assuming you have two groups, Treatment & Control, same 5 clusters found in both groups and the proportions of the different cell types are as expected in the system we're studying.

When performing DGE of Treatment vs Control within each cell custer separately (eg Cluster 1: Treatment vs Control, Cluster 2: Treatment vs Control, etc), would you expect to get more DEGs (ie a longer gene list) in the cluster that has more cells compared with the other clusters with lesser cells? In the system that I'm studying, one particular cell type (Cluster 1) makes up 70% of the cells in that system and it is implicated in the disease that we are studying too. However, more cells would mean a higher power to detect DEGs, no? In that case, how does this affect the intrepretation of the DEGs for the other clusters or Cluster 1?

And cell number aside, how can one confidently check in the scRNA-seq data that the longer list of DEGs in Cluster 1 is due to the biology, and not the cell number? In case it is helpful, we are using the Seurat FindMarkers function (default settings). Thank you!

scRNA-seq • 1.1k views

ADD COMMENT • link 12 months ago by cfa24357 • 0

1

Entering edit mode

In case it is helpful, we are using the Seurat FindMarkers function (default settings).

Yes that information is important for addressing your question. Seurat::FindMarkers by default uses the Wilcoxon Rank Sum test and in my experience the more cells you include in this test the more exaggerated your p-values will be (i.e. more likely to get smaller p-values).

You will want to devise DGE filtering criteria with this in mind and I also recommend using minimum percent thresholds as well such that for a gene to be considered differentially expressed in that cluster then a minimum of cells must be expressing that particular gene.

ADD REPLY • link 12 months ago by jv ★ 1.8k

0

Entering edit mode

Thanks for taking the time to reply to my question! It's helpful to know that Wilcoxon Rank Sum may lead to exaggerated p-values, will bear this in mind for future analyses. Would the DESeq2 test in Seurat::FindMarkers be a better option in this case?

ADD REPLY • link 12 months ago by cfa24357 • 0

1

Entering edit mode

I can't say one way or the other if DESeq2 will be better. I recommend this paper Confronting false discoveries in single-cell differential expression which suggests that if you have replicated samples the best option is a pseudobulk approach.

Of course before you do differential expression analysis, you'll want to do a thorough quality control assessment of your cells as well as optimization of your clustering - perhaps there is further subclustering that can be done on this large cluster? Sounds like you expect 70% of your cells to be a the same cell type but might some of these cells be in different transcriptional states?

ADD REPLY • link 12 months ago by jv ★ 1.8k

0

Entering edit mode

Thanks! We are quite confident in the QC and clusters we have since it's a well-established system, and we wanted to initially get a big picture overview of what's happening with the known cell types when treated, before deep-diving into how one group of cells might be responding heterogenously. It's definitely in the plan to subcluster that large group to identify diff transcriptional states, as you've suggested.

Yes, I did see that pseudobulk approaches are recommended for differential expression analysis. I think my inital approach was to try doing everything in Seurat first, using the default settings, before changing any of the steps to see how the results might differ. Thanks again for your help :)

ADD REPLY • link 12 months ago by cfa24357 • 0

score 1 · Answer 1 · 2023-04-27

1

Entering edit mode

12 months ago

Papyrus ★ 2.9k

You may want to look at the Augur R package which was specifically developed to address this issue (that statistical power is influenced by number of cells in each test) and give you information of which cells are the most "affected" in your system irrespective of their numbers.