Question

How to handle datasets with different cell and read numbers per cell?

3

Entering edit mode

3.2 years ago

VolEr ▴ 40

Hi, I am a newcomer to single-cell data analysis, and I have a simple question that I believe is a piece of cake for you guys. Briefly, I have two single-cell expression datasets (10X platform) from WT and KO samples. CellRanger estimated that WT has 3000 cells, while KO has 1000 cells, and of course, they got different read numbers per cell as WT is 42K means reads, KO is 115K mean reads per cell. Currently, I am working on these data, and simply following Seurat's guide on their website. My question is how would you handle your data analysis starting from such datasets with 3K cells vs 1K cells? I see that people use very different ways to analyze their scRNA data such as SCT or Lognormalize, Wilcoxon or MAST, SplitObject before SCTransforming or not, etc. I don't know which one is the best practice, which way I can follow but in the end, I am trying to get reproducible results. So, I use Seurat's SCTransfrom assuming SCT removes batch effect (!) from the merged dataset as I have different cell and read numbers. Please see what code I use below, and leave a comment regarding the code if you have any suggestions on it. I would also be glad if you could share any code or GitHub link that I can follow in RStudio for an improved analysis because GitHub currently finds thousands of matches once you type Seurat, 10X data, or single-cell, etc. Best, Volkan

library(Seurat)
library(patchwork)
library(cowplot)
library(ggplot2)

WT.data <- Read10X(data.dir = "~/Desktop/A1-filtered_feature_bc_matrix")
KO.data <- Read10X(data.dir = "~/Desktop/A2-filtered_feature_bc_matrix")

WT <- CreateSeuratObject(counts = WT.data, project = "WT", min.cells = 3, min.features = 200)
WT[["percent.mt"]] <- PercentageFeatureSet(WT, pattern = "^mt-")
VlnPlot(WT, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
WT <- subset(WT, subset = nFeature_RNA > 200 & percent.mt < 5)

KO <- CreateSeuratObject(counts = KO.data, project = "KO", min.cells = 3, min.features = 200)
KO[["percent.mt"]] <- PercentageFeatureSet(KO, pattern = "^mt-")
VlnPlot(KO, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
KO <- subset(KO, subset = nFeature_RNA > 200 & percent.mt < 5)

sample.integrated <- merge(x = WT, y = KO, add.cell.ids = c("WT", "KO"), project = "MyProject")

all.list <- SplitObject(sample.integrated, split.by = "orig.ident")

for (i in names(all.list)) {
  all.list[[i]] <- SCTransform(all.list[[i]], verbose = FALSE)
}

sample.features <- SelectIntegrationFeatures(object.list = all.list, nfeatures = 4000)
sample.list <- PrepSCTIntegration(object.list = all.list, anchor.features = sample.features)
sample.anchors <- FindIntegrationAnchors(object.list = sample.list, normalization.method = "SCT", 
                                         anchor.features = sample.features)
sample.integrated <- IntegrateData(anchorset = sample.anchors, normalization.method = "SCT")
sample.integrated <- RunPCA(object = sample.integrated, verbose = FALSE)
sample.integrated <- RunUMAP(object = sample.integrated, dims = 1:30)
sample.integrated <- FindNeighbors(sample.integrated, reduction = "pca", dims = 1:20)
sample.integrated <- FindClusters(sample.integrated, resolution = 0.5)

p1 <- DimPlot(sample.integrated, reduction = "umap", group.by = "orig.ident", pt.size = 0.2, cols = c("blue", "pink"))
p2 <- DimPlot(sample.integrated, reduction = "umap", label = T, label.size = 4, repel = T, pt.size = 0.2)
plot_grid(p1, p2)

all.markers <- FindAllMarkers(sample.integrated, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
top10 <- all.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
DoHeatmap(sample.integrated, features = top10$gene) + NoLegend()

DefaultAssay(sample.integrated) <- "RNA"
Conserved_cluster0 <- FindConservedMarkers(sample.integrated, ident.1 = 0, only.pos = TRUE, logfc.threshold = 0.25, grouping.var = "orig.ident")

and so on.

scRNA-seq 10X data Seurat R SCTransform • 1.8k views

ADD COMMENT • link updated 3.2 years ago by rpolicastro 13k • written 3.2 years ago by VolEr ▴ 40

score 1 · Answer 1 · 2021-02-03

1

Entering edit mode

3.2 years ago

rpolicastro 13k

SCTransform plus integration is indeed the preferred method for joint analysis of different related samples with Seurat. After clustering you need to switch back to the RNA assay and run NormalizeData, and then use the counts from the 'data' slot for analysis like finding markers and FeaturePlots. The counts from SCTransform and Integration are designed for dimension reduction and clustering, but shouldn't be used beyond that.

ADD COMMENT • link 3.2 years ago by rpolicastro 13k

0

Entering edit mode

Thank you rpolicastro for your reply, I appreciate it. I saw several codes like what you said. But, I was confused, as its vignette says SCT replaces NormalizeData, ScaleData, and FindVariableFeatures. So that's why I was ignoring additional normalization. Please correct me if I am wrong, you suggest that I need to get back to RNA as default assay, then I need to run NormalizeData to perform further steps like finding markers, FeautrePlotting, or DEG analysis etc right. So, in this case, should I add this normalization step before FindAllMarkers, or before FindConservedMarkers?

I mean is this correct?

all.markers <- FindAllMarkers(sample.integrated, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
top10 <- all.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
DoHeatmap(sample.integrated, features = top10$gene) + NoLegend()

DefaultAssay(sample.integrated) <- "RNA"
sample.integrated <- NormalizeData(sample.integrated)
Conserved_cluster0 <- FindConservedMarkers(sample.integrated, ident.1 = 0, only.pos = TRUE, logfc.threshold = 0.25, grouping.var = "orig.ident")

Thanks a lot.

ADD REPLY • link 3.2 years ago by VolEr ▴ 40

0

Entering edit mode

You should add it before FindAllMarkers.

This question is addressed in the Seurat FAQs here about why you should use the counts from the RNA slot.

ADD REPLY • link 3.2 years ago by rpolicastro 13k

0

Entering edit mode

Thank you rpolicatsro, totally understand what you mean. I'll also reanalyze my data by applying the parameters mentioned on the GitHub link. Best, Volkan

ADD REPLY • link 3.2 years ago by VolEr ▴ 40