Question

Choice for statistical test (and R package)

0

Entering edit mode

13 months ago

psm ▴ 130

Hi all,

I have a dataset of single cell RNA sequencing (to be precise, single cell RNA profiling using the probe-based 10X RNA FLEX assay). I had cells that I split and treated identically, but one set got Interferon gamma (IFNg). I then collected the cells and processed for sequencing.

I've received the data and they clustered nicely as expected (they are stem cell-derived cells and as such, heterogeneous). The two sets of cells clustered nicely after Seurat integration, which representation in each cluster of cells from both conditions (+/- IFNg)

My question is which test would be ideal for comparing individual clusters for transcriptional differences as a result of the IFNg treatment? My understanding is that this is very different from Marker gene analysis (e.g. FindMarkers in Seurat) because with Marker analysis, transcriptional differences were already partially considered during the clustering, whereas here the assignment of IFNg treatment was in theory independent of a priori transcriptional differences. But I don't understand all the nuances here!

If anyone could provide any guidance on how I might go about choosing a test for this analysis (and perhaps a suitable package for this purpose) I would be most grateful!

Thanks in advance!

RNA-seq DGE • 1.0k views

ADD COMMENT • link 13 months ago by psm ▴ 130

score 3 · Accepted Answer · 2023-06-19

3

Entering edit mode

13 months ago

ATpoint 84k

It's essentially the same as marker detection. In scRNA-seq one usually processes data in a way (integration) that treatment effects do not confound the clustering, so the same cell (regardless of treatment) ends up in the same cluster, allowing a direct comparison. Once you have this, you can then compare the treatments per cluster. Standard DE tools such as limma can be used. I suggest to read work from Soneson 2018 (https://pubmed.ncbi.nlm.nih.gov/29481549/) on differential expression of scRNA-seq. I personally use limma-voom, with CPM prefiltering, requiring that at least one of two groups (here treatment) expresses a given gene with > 1 CPM in > 25% of cells. That is inspired by this article I linked and helps removing spurious DE calls, especially for genes with very many zeros where DE calls could come from few outliers. Marker detection is essentially a more strict version of that, so you do a lot of pairwise DE and then select genes that are consistently overexpressed in a given versus the rest of all clusters.

ADD COMMENT • link 13 months ago by ATpoint 84k

0

Entering edit mode

This was extremely helpful, thank you. Would you by any chance be able to share your workflow (or even some sample code) for applying Limma-voom to scRNA-seq datasets? Do you just do something like the following: -Obtain CPM for each barcode -Filter genes to retain those with >1CPM in >25% of cells in one group -Perform Limma-voom using the native function, treating each barcode as a "sample"

It would be really helpful if you could let me know what you might do differently!

ADD REPLY • link 13 months ago by psm ▴ 130

1

Entering edit mode

Here is how I personally (not saying it's perfect) tackle this:

suppressMessages({
  library(edgeR)
  library(scuttle)
  library(SingleCellExperiment)
})

# Example data
set.seed(1)
sce <- scuttle::mockSCE(ncells=1000, nspikes=0)
sce$group <- factor(rep(c("A", "B"), each=500))

# Normalize using scran, see:
# http://bioconductor.org/books/release/OSCA.basic/normalization.html#normalization-by-deconvolution
qq <- scran::quickCluster(x=sce, graph.fun="louvain", method="igraph")
sizeFactors(sce) <- scran::calculateSumFactors(sce, cluster=qq)

# Calculate CPM
assay(sce, "cpm") <- scuttle::calculateCPM(sce)

# Per gene, check how many % of cells per group express the gene with >= 1 CPM
min.cpm <- 1
dat <- (assay(sce, "cpm") >= min.cpm) * 1 
a <- base::rowsum(x=t(dat), group=sce$group)
b <- as.numeric(table(sce$group)[rownames(a)])

# This is now the percentage of cells per group that express a gene with more than min.cpm
pct <- round(100*t(a/b), digits=2)
head(pct, 3)
#>              A    B
#> Gene_0001 57.8 54.8
#> Gene_0002 72.8 69.8
#> Gene_0003 95.4 96.8

# Filter for genes that in at least one of the two groups express a gene with > min.pct %
min.pct <- 25
keep.genes <- apply(pct >= min.pct, 1, sum) > 0
keep.genes <- names(keep.genes[keep.genes])

# See how many genes we keep and discard
data.frame(keep=length(keep.genes), discarded=nrow(sce)-length(keep.genes))
#>   keep discarded
#> 1 1840       160

# Subset the sce
sce <- sce[keep.genes,]

# Test with limma, for this we convert the sce to a DGEList.
# The sizeFactors from the sce will appropriately be transferred to y
y <- scran::convertTo(sce, type="edgeR", assay.type="counts")
design <- model.matrix(~group, colData(sce))

# Run limma, using treat to test against a minimal fold change threshold to avoid
# significant but tiny logFCs, especially in scRNA-seq with many cells per group
v <- limma::voom(y, design=design)
fit <- limma::lmFit(object=v, design)
fit <- limma::treat(fit, fc=1.2, robust=TRUE)
tt  <- limma::topTreat(fit, coef=2, n=Inf)
head(tt)
#>                logFC  AveExpr         t    P.Value adj.P.Val
#> Gene_1691 -0.5877968 6.660682 -2.101950 0.01786284 0.9999855
#> Gene_0223 -0.5686722 3.506022 -1.707765 0.04394625 0.9999855
#> Gene_1777  0.5302078 2.210554  1.666917 0.04787128 0.9999855
#> Gene_0182  0.4982659 2.323004  1.440029 0.07503664 0.9999855
#> Gene_0447  0.5179333 4.052508  1.398060 0.08116166 0.9999855
#> Gene_1451  0.4814781 6.194318  1.375025 0.08466698 0.9999855
Created on 2023-06-21 with reprex v2.0.2