EdgeR for single cell differential expression analysis
2
0
Entering edit mode
19 months ago
kvshamsudheen ▴ 100

I am trying to perform a more faster way of differential analysis between KO and WT single cell data. I have intergated the data using Seurat pipeline and the differential expression using DESeq2 Seurat implementation. As I have about 4k cells in each clusters, it is taking forever. Meanwhile I found the edgeR and make it working based on consolidating many biostar discussions and the edgeR manual as below

## subset Seurat object for the cluster
seurat_object_cellfor_edgeR <- subset(seurat_object, subset = seurat_cluster == 0)

# count matrix
counts <- as.matrix(seurat_object_cellfor_edgeR@assays$RNA@counts) counts <- counts[Matrix::rowSums(counts >= 1) >= 1, ] # subset the meta data fro filtered gene/cells metadata <- seurat_object_cellfor_edgeR@meta.data metadata <- metadata[,c("nCount_RNA", "cell_info")] metadata <- metadata[colnames(counts),] ### Make single cell experiment sce <- SingleCellExperiment(assays = counts, colData = metadata) ## convert to edgR object dge <- convertTo(sce, type="edgeR", assay.type = 1) meta_dge <- dge$samples
meta_dge <- meta_dge[,c("lib.size","norm.factors")]
meta_dge <- cbind(meta_dge, metadata)
meta_dge$group <- factor(meta_dge$cell_info)
meta_dge$group <- relevel(meta_dge$group, "KO")
dge$samples <- meta_dge  model fit dge <- calcNormFactors(dge) design <- model.matrix(~0+group, data=dge$samples)
dge <- estimateDisp(dge, design = design)
fit <- glmQLFit(dge, design = design)


Differential expression test

qlf_as.is <- glmQLFTest(fit)
# top genes
headqlf_as.is$table) logFC logCPM F PValue AL627309.1 -15.82462 8.083615 1355354.8 0 AL669831.5 -15.42607 8.111567 444435.6 0 LINC00115 -15.76849 8.087425 576319.8 0 FAM41C -15.66446 8.094247 501766.5 0 AL645608.3 -15.76752 8.087299 591389.5 0 AL645608.1 -15.83204 8.083359 1661128.7 0  and found to be all the genes are differentially regulated, no FDR and all p value is zero. So I explored the contrast for the testing and that gave me much more reasonable and expected results (but the FC is too low!) my.contrasts <- makeContrasts(KO_vs_WT = groupKO-groupWT, levels=design) qlf.contrast <- glmQLFTest(fit, contrast=my.contrasts[,"KO_vs_WT"]) head(qlf.contrast$table)
logFC   logCPM          F    PValue
AL627309.1 -0.02402289 8.083615 36.0058677 0.3631278
AL669831.5 -0.16083392 8.111567  6.8890345 0.1892858
LINC00115   0.02784262 8.087425  2.5886375 0.6827443
FAM41C     -0.02251330 8.094247  0.4922016 0.8021113
AL645608.3 -0.02544709 8.087299  2.4978086 0.6936892
AL645608.1  0.03523809 8.083359 48.3970011 0.3426162


With this back ground my questions are

1. Which one of this method is correct? If none of them, how to perform the differential testing?.
2. Why the FDR is not present in any one of the model. Am I missing something?
3. Can I include mitochondria read percentage as a covariant and is it require to include UMI count for the model?

Any help is highly appreciated!.

edgeR Differential Expression Single-Cell • 3.1k views
0
Entering edit mode

Can you post a snippet of what your design looks like? and also my.contrasts. I think without knowing what your design/test is, it's hard to get an understanding.

For starts, the glmQLFTest requires you to specify either a coef or a contrast.

eg: glmQLFTest(fit, contrast=my.contrasts) (there's no need to subset the my.contrasts object)

I think this is the reason why no fdr/pvalue is given. edgeR is unsure what to test.

Final point, can you try topTags function to view the results instead of head? Not sure if it will make a difference but this is the method I always use (and is noted in the manual)

0
Entering edit mode

Thanks. Yes, the topTags gives the FDR, but I am not sure why I am getting too low fold change

head(design)
groupKO             groupWT
AAACCCAAGCGTCTCG                      1                   0
AAACCCACAAGTTCCA                      1                   0
AAACCCACAAGTTCGT                      1                   0
AAACCCACAGCACGAA                      1                   0
AAACGAAAGACTGTTC                      1                   0
AAACGAAAGGTCACTT                      1                   0
my.contrasts
Contrasts
Levels            KO_vs_WT
groupWT          -1
groupKO           1

1
Entering edit mode

Thank you. Your design and contrasts are fine and you should perform the glmQLFTest as described above. Regarding your question about FC, not 100% sure sorry it's outside of my expertise. You should read these to get a better understanding: https://support.bioconductor.org/p/108762/ and https://www.rdocumentation.org/packages/edgeR/versions/3.14.0/topics/predFC

I take the logFC values produced by edgeR at face value, that is if you average the counts and take the difference between groups (and factor in lib differences). Of course as noted in by Prof Smyth, it's much more complicated than simply adding and dividing. In other words, you have small logFC values because the expression differences between those genes is small. Don't get fixated on the logFC values of genes. You should consider the overall "picture" of the expression profile between your groups. A small fold change can still have a remarkable effect on the biological system you're studying.

4
Entering edit mode
19 months ago
kvshamsudheen ▴ 100

For those who are still looking for the right answer, can have a look at these scripts used for the Soneson and Robinson 2018, Nat. Meth: PMID: 29481549 paper for comparing different DE methods for single cell RNA data set. Both of them are a great start for the same. Cheers!.

0
Entering edit mode

Hello,

Helpful post. I have a question hopefully you can help. I am comparing differentially expressed genes (DEGs) between WT and KO. When I want to compare DEGs between conditions specific to particular cluster I use Findcluster function of Seurat tool.

Now I wanted to compare overall DEGs between WT and KO and NOT specific to any cluster. Could you let me know how can I accomplish in Seurat or using edgeR. Thank you,

3
Entering edit mode
9 months ago
ATpoint 50k

There was a recent preprint introducing a new tool https://github.com/const-ae/glmGamPoi making the dispersion estimation and model fitting to single-cell data notable faster compared to the current implementations of edgeR / DESeq2. It also implements a quasi-likelihood ratio test very similar to what edgeR does. Might be worth considering as a replacement for edgeR/DESeq2 when working on single-cell since these tools are quite slow due to the large number of cells that have to be considered.