I am trying to perform a more faster way of differential analysis between KO and WT single cell data. I have intergated the data using Seurat pipeline and the differential expression using DESeq2 Seurat implementation. As I have about 4k cells in each clusters, it is taking forever. Meanwhile I found the edgeR and make it working based on consolidating many biostar discussions and the edgeR manual as below
## subset Seurat object for the cluster     
seurat_object_cellfor_edgeR <- subset(seurat_object, subset = seurat_cluster == 0)
# count matrix 
counts <- as.matrix(seurat_object_cellfor_edgeR@assays$RNA@counts)
counts <- counts[Matrix::rowSums(counts >= 1) >= 1, ]
# subset the meta data fro filtered gene/cells
metadata <- seurat_object_cellfor_edgeR@meta.data
metadata <- metadata[,c("nCount_RNA", "cell_info")]
metadata <- metadata[colnames(counts),]
### Make single cell experiment
sce <- SingleCellExperiment(assays = counts, 
                                    colData = metadata)
## convert to edgR object
dge <- convertTo(sce, type="edgeR", assay.type = 1)
meta_dge <- dge$samples
meta_dge <- meta_dge[,c("lib.size","norm.factors")]
meta_dge <- cbind(meta_dge, metadata)
meta_dge$group <- factor(meta_dge$cell_info)
meta_dge$group <- relevel(meta_dge$group, "KO")
dge$samples <- meta_dge
model fit
dge <- calcNormFactors(dge)
design <- model.matrix(~0+group, data=dge$samples)
dge <- estimateDisp(dge, design = design)    
fit <- glmQLFit(dge, design = design)
Differential expression test
qlf_as.is <- glmQLFTest(fit)
# top genes
 headqlf_as.is$table)
               logFC   logCPM         F PValue
AL627309.1 -15.82462 8.083615 1355354.8      0
AL669831.5 -15.42607 8.111567  444435.6      0
LINC00115  -15.76849 8.087425  576319.8      0
FAM41C     -15.66446 8.094247  501766.5      0
AL645608.3 -15.76752 8.087299  591389.5      0
AL645608.1 -15.83204 8.083359 1661128.7      0
and found to be all the genes are differentially regulated, no FDR and all p value is zero. So I explored the contrast for the testing and that gave me much more reasonable and expected results (but the FC is too low!)
my.contrasts <- makeContrasts(KO_vs_WT = groupKO-groupWT, levels=design)
qlf.contrast <- glmQLFTest(fit, contrast=my.contrasts[,"KO_vs_WT"])
head(qlf.contrast$table)
                 logFC   logCPM          F    PValue
AL627309.1 -0.02402289 8.083615 36.0058677 0.3631278
AL669831.5 -0.16083392 8.111567  6.8890345 0.1892858
LINC00115   0.02784262 8.087425  2.5886375 0.6827443
FAM41C     -0.02251330 8.094247  0.4922016 0.8021113
AL645608.3 -0.02544709 8.087299  2.4978086 0.6936892
AL645608.1  0.03523809 8.083359 48.3970011 0.3426162
With this back ground my questions are
- Which one of this method is correct? If none of them, how to perform the differential testing?.
 - Why the FDR is not present in any one of the model. Am I missing something?
 - Can I include mitochondria read percentage as a covariant and is it require to include UMI count for the model?
 
Any help is highly appreciated!.
Can you post a snippet of what your
designlooks like? and alsomy.contrasts. I think without knowing what your design/test is, it's hard to get an understanding.For starts, the
glmQLFTestrequires you to specify either acoefor acontrast.eg:
glmQLFTest(fit, contrast=my.contrasts)(there's no need to subset themy.contrastsobject)I think this is the reason why no fdr/pvalue is given. edgeR is unsure what to test.
Final point, can you try
topTagsfunction to view the results instead of head? Not sure if it will make a difference but this is the method I always use (and is noted in the manual)Thanks. Yes, the
topTagsgives the FDR, but I am not sure why I am getting too low fold changeThank you. Your design and contrasts are fine and you should perform the glmQLFTest as described above. Regarding your question about FC, not 100% sure sorry it's outside of my expertise. You should read these to get a better understanding: https://support.bioconductor.org/p/108762/ and https://www.rdocumentation.org/packages/edgeR/versions/3.14.0/topics/predFC
I take the logFC values produced by edgeR at face value, that is if you average the counts and take the difference between groups (and factor in lib differences). Of course as noted in by Prof Smyth, it's much more complicated than simply adding and dividing. In other words, you have small logFC values because the expression differences between those genes is small. Don't get fixated on the logFC values of genes. You should consider the overall "picture" of the expression profile between your groups. A small fold change can still have a remarkable effect on the biological system you're studying.