Dear colleagues,
I am working on a single-cell RNA-seq dataset derived from two experimental groups: a disease group and a healthy control group. Each group contains data from six individual mice, totaling twelve samples. My goal is to compare the cell-type compositions between these two conditions to identify shifts potentially associated with disease status.
I would greatly appreciate advice on how best to structure the analysis. Specifically, I am wondering:
Should I treat each mouse as an independent sample and perform differential composition analysis at the individual level?
Or is it acceptable to pool the cells from all six mice in each group to create two aggregated groups and compare them directly?
I am aware that pooling samples may ignore biological variability and risks introducing batch effects, but I'm also concerned about the statistical power and granularity of cell-type annotations when treating each mouse separately.
Has anyone faced a similar scenario, and could you recommend best practices or refer me to relevant literature or tools?
Thank you in advance for your guidance!
Best regards,
Zhang Chengwei
Agreed. The lack of either biological replication at all (no hashed samples by donor etc or one lane per donor), the lack of using existing biological replication in statistical tests, or poor differential analysis methods (I am looking at you Seurat::FindMarkers with your dead-oversimplified non group-aware Wilcox testing) is imo one of the biggest, if not the biggest flaw in scRNA-seq, and almost every study is guilty of it. In our hands (unsurprisingly) hashing donors, and then properly accounting for biological replication improves inference quality a lot. Often, just blocking or regressing the donor effect durng feature selection already removes odd clustering landscapes and emphasizes celltype rather than donor differences (also in mice, not just in heterogeneous human samples). Just using all cells per cluster/celltype/... blindly can lead to the situation that some outlier cells from a single or few donor push the mean/median of gene expression up into statistical significance. scRNA-seq papers are often so poorly analyzed, it's embarassing.
Thank you very much for your helpful suggestion. Including the mouse of origin in the design formula makes a lot of sense, and I appreciate the emphasis on preserving biological variability while optimizing statistical power. The OSCA book chapter on differential abundance testing is an excellent reference—I’ll be sure to study it closely.
Thanks again for pointing me in the right direction!