Hello,
I'm facing a challenge with varying cell counts between my control and disease groups in a single-cell RNA sequencing experiment. Specifically, the control group has a higher number of cells than the disease group.
Here are the cell counts for each subgroup:
- HC1: 2059
- HC2: 468
- HC3: 3333
- Disease1: 428
- Disease2: 1610
- Disease3: 1189
My concern is that having more cells in the control group will influence clustering, aligning cells to HC subpopulations with higher number of clusters, splitting disease cells into more clusters and making it difficult to perform DE genes as there will be less cells per cluster.
To address this imbalance, I'm considering subsampling the control group to 1500 cells. However, I'm concerned about potential biases in clustering and downstream analyses.
What methods or considerations should be employed to evaluate the impact of subsampling on downstream analyses, particularly in terms of differential expression analysis with fewer cells per cluster?
What precautions should I take to ensure that subsampling the control group maintains an accurate representation of biological variability within healthy samples?
Thank you in advance!