Question

Downsampling one of the sample on the UMAP clustering to match the number of cells of the other group

1

Entering edit mode

4.2 years ago

bioinforesearchquestions ▴ 370

Hi friends,

I have two groups WT and KO. I have roughly 950 cells in WT sample and 1700 cells in KO sample. As I mentioned in this post C: How many PCs should be considered for downstream analyses?.

Steps followed for the analyses:

- Individual seurat object for WT and KO 
- Merged WT_KO 
- Filtered_WT_KO 
- Split_cond_WT_KO (Normalize, cell cycle scoring and SCTransform) 
- SelectIntegrationFeatures 
- PrepSCTIntegration 
- FindIntegrationAnchors and SCT Normalization 
- IntegrateData 
- Run PCA 
- FindNeighbors 
- FindClusters 
- Run UMAP

One of the collaborator wants to downsample in order to visually see them with equal number of cells on the UMAP cluster and as well as for comparing specific markers between the WT and KO samples.

Which approach is the appropriate one,

Approach1 :Get a list of 895 KO random barcode IDs from the 1,700 cells MTX files(cellranger's matrix.mtx file), then followed the above steps for both 895 actual WT cells and random 895 KO cells (from 1700 actual cells).

Approach2: Is it ok to just downsample to 895 KO cells from 1700 KO cells on the UMAP plot without running the all prior steps? So that both WT and KO will have only 895 cells each on the plot.

What are pros and cons of approach1 and approach2?

scRNAseq UMAP downsampling RNAseq • 6.2k views

ADD COMMENT • link updated 4.2 years ago by jared.andrews07 ★ 16k • written 4.2 years ago by bioinforesearchquestions ▴ 370

score 6 · Accepted Answer · 2020-02-20

6

Entering edit mode

4.2 years ago

jared.andrews07 ★ 16k

Approach 1 is a poor idea. Downsampling removes information, lowering the power of your differential expression analysis. This could result in marker genes being lost. If they just want to visualize them equally, I would just grab equal numbers of cells from each condition for visualization and feed them to the cells parameter for FeaturePlot or cells.use parameter for DimPlot.

Alternatively, you could plot differences using methods that show summary statistics for each group, like boxplots, violin plots, or ridgeplots. While the Seurat functions do okay with these, I prefer using dittoSeq, which allows for much greater customization and generally just looks better by default.

You don't need to re-run your entire analysis. Approach 2 is fine.

ADD COMMENT • link 4.2 years ago by jared.andrews07 ★ 16k

0

Entering edit mode

Yeah I tried the approach2.

WT_cells <-  which(WT_KO_integrated_seurat$sample == 'WT')
KO_cells <- which(WT_KO_integrated_seurat$sample == 'KO')
downsampled_KO_cells <- sample(KO_cells, 895)
WT_KO_integrated_downsampled <- WT_KO_integrated_seurat[,c(WT_cells, downsampled_KO_cells)]
DimPlot(WT_KO_integrated_downsampled, reduction = "umap", split.by ="orig.ident", ncol=2)

I have downsampled 4 times and recorded the number of cells in each cluster for all 4 versions (V1, V2, V3 and V4).

1) The downsampled percentage of cells in WT and KO is more over same compared to the actual % of cells in WT and KO

2) In each versions, I have highlighted the KO cells for cluster 1, 4, 5, 6 and 7 where the downsampled number is less than the WT cells. But before downsampling, if you see KO cells are higher compared to WT cells. However for cluster 0, 2 and 3 the trend is preserved.

Is it normal?

ADD REPLY • link 4.2 years ago by bioinforesearchquestions ▴ 370

0

Entering edit mode

I mean, you're downsampling to a specific number of cells, so yes? Clusters 1, 4, 5, 6, 7 had closer to equal ratios of KO to WT cells, so losing cells from those groups means you're going to end up with fewer KO cells than WT cells.

ADD REPLY • link 4.2 years ago by jared.andrews07 ★ 16k

0

Entering edit mode

Yes, the collaborator requested equal number of cells on both the WT and KO samples. This is what I expected initially equal number of cells in each conditions under each cluster as well. The below image is based on the dummy data I manually entered.

But in the above image actual KO value vs downsampled KO value is fluctuating highly instead of downsampled versions V1, V2, V3, and V4.

From these different version of downsampled analyses, I am thinking of downsampling only on selected clusters like cluster 0, 1, 2, and 3 alone. Is there a way to do it?

You mentioned in your reply "If they just want to visualize them equally, I would just grab equal numbers of cells from each condition for visualization and feed them to the cells parameter for FeaturePlot or cells.use parameter for DimPlot."

Does the code which I provided above carry out that taks? Do you have any other way to do it?

ADD REPLY • link 4.2 years ago by bioinforesearchquestions ▴ 370

1

Entering edit mode

So your collaborator wants equal numbers of cells per cluster? Seems silly to me.

Presumably you've already done what I suggested by performing your manual subsetting - your UMAP DimPlot should only contain the subsetted cells. My suggestion was just to use the parameters from the plotting functions to indicate which cells should be used from your full data set. Two approaches, should result in the same outcome.

This still feels like a job for violin or box plots. Particularly with dittoSeq, where you can hide the individual points and just show the box or violin plot. Then you could use your full dataset without your collaborator complaining about unequal cell numbers (why exactly do they want this?).

ADD REPLY • link 4.2 years ago by jared.andrews07 ★ 16k