I am working with scRNAseq data with different conditions, and each condition has a very different number of cells, e.g. 1,000 vs. 2,000 vs. 5,000. otherwise, it's the same experiment.
I am wondering whether it would be more prudent to downsample to the lowest number across replicates, in this case 1,000, and then run the various analyses with Seurat. the reason is that in the first group there weren't simply many cells post collection, while in the other conditions we just had more. On a separate occasion, we had the same issue, but it was caused by problems with flow sorting the single cells.
here, when I then visualize the three groups via UMAP I see the last one having a ton more cells in all clusters, while instead, it is just due to the different number of cells we started with.
what are your thoughts?
thank you :)
-- EDIT - added extra info to address ATpoint comment
in the image below you can see what I mean
the green group is the one with ~5k cells, ~2,500 for the red, and 1k for the blue. if I had to look at the abundance of the immune cells in the various groups, then in the green group will have the highest abundance, simply because it started with a (much) higher number of cells.
if you now look at the next image
the differences that you now see in abundance are not driven by the initial number of cells. subsampling was made by randomly selecting the same number of cells in each group
as far as gene expression goes, the higher/lower number of cells (for my experiment) will not be affected, so I agree that downsampling for this would not be a good idea. however, I work with a lot of immunologists that want to know the frequency of cells positive for x, y, z comparing across samples, so wouldn't downsampling to the same number be helpful in this case?