Question

Forum:LogFC on geometric means vs on arithmetic means. Presto vs Seurat

0

Entering edit mode

8 weeks ago

e.r.zakiev ▴ 260

We take calculations of logFC in our favourite packages as granted, but it actually isn't as straightforward. Let's take the Seurat's way of calculating logFC:

> Seurat:::FoldChange.default
function (object, cells.1, cells.2, mean.fxn, fc.name, features = NULL, 
    ...) 
{
    features <- features %||% rownames(x = object)
    thresh.min <- 0
    pct.1 <- round(x = rowSums(x = object[features, cells.1, 
        drop = FALSE] > thresh.min)/length(x = cells.1), digits = 3)
    pct.2 <- round(x = rowSums(x = object[features, cells.2, 
        drop = FALSE] > thresh.min)/length(x = cells.2), digits = 3)
    data.1 <- mean.fxn(object[features, cells.1, drop = FALSE])
    data.2 <- mean.fxn(object[features, cells.2, drop = FALSE])
    fc <- (data.1 - data.2)
    fc.results <- as.data.frame(x = cbind(fc, pct.1, pct.2))
    colnames(fc.results) <- c(fc.name, "pct.1", "pct.2")
    return(fc.results)
}
<bytecode: 0x324be7970>
<environment: namespace:Seurat>

where mean.fxn is calculated as:

default.mean.fxn <- function(x) {
  return(log(x = (rowSums(x = expm1(x = x)) + pseudocount.use)/NCOL(x), base = base))
}

So in Seurat we calculate log of ratio of arithmetic means, unlike in presto::wilcoxauc, where we use the geometric mean. enter image description here

In presto we calculate the logFC as follows:

lfc <- Reduce(cbind, lapply(seq_len(length(levels(y))), function(g) {
  group_means[, g] - ((cs - group_sums[g, ]) / (length(y) - gs[g]))
}))

So that is a log of ratio of geometric means. Which is, according to Lötsch et al. (2024), better suited for scRNAseq data (arithmetic mean leads to inaccuracies when subgroup variances or underlying distributions differ, while the geometric mean or median yields more robust estimates in heterogeneous data).

So should we just conclude that the presto's implementation is better and Seurat should switch to it too?

Presto Seurat logFC general questions • 1.1k views

ADD COMMENT • link updated 8 weeks ago by ATpoint 89k • written 8 weeks ago by e.r.zakiev ▴ 260

3

Entering edit mode

I personally think that the DE machinery in Seurat is entirely terrible and oversimplified. What they do is simply (by default) to test a given group of cells (e.g. a cluster) against the union average of all other cells, regardless which cluster the other cells belong to. There is also no way to test against a fold change so you end up with a lot of genes with small effect size but statistical significance. That might in some cases be ok for markers alone, where you essentially get the top-top-top genes per cluster, but for a proper pairwise analysis it's imo largely useless.

The thing here is also that the Wilcox test purely focuses on the rank differences, it does not test the fold changes. Still, most people will filter on the fold change which is not robust. Fold changes in scRNA-seq often have large variances due to the spread of expression values and presence of zeros. If you use a method like limma and let it return confidence intervals you will impressively see that. Or just plot single-cell expression of a couple of genes to visualize that easily. Anyway, I think it's basic data science knowledge that in especially in the presence of large variances you need to go geometric, not arithmetic.

I personally always use limma for any DE testing. Pseudobulk is almost always better than single-cell level, but the experimental design must enable it -- most studies do not care about biological replication unfurtunately. When using limma, you can relatively easily do the one-versus-average testing like A - (B+C+D)/3 to get the genes that are overexpressed in A versus the average of the rest, but group-aware, unlike what Seurat does.

In a nutshell, I would prefer geometric mean-based logFCs if you have to choose between your two options, but actually rather use limma to get them properly moderated, and with confidence intervals. Limma also allows testing against fold changes, with its treat function, so you can protect against genes with tiny effect size but statistical significance. Unfortunately both Seurat and Presto do not (to my knowledge) support this. Post-hoc filtering of logFCs is not good, because oftentimes there is evidence that the logFC between groups is not 0 (the default test), but there is little evidence that it is e.g. > 1.1 because of all the noise. Such testing in my hands protects against a lot of spurious DEGs. Also essential: Proper prefiltering, see e.g. work from Soneson et al on single-cell DE testing, Nature Methods from a few years ago. End of my rant, thank you. Don't drag this to an answer, it's just a "I had no coffee" rant.

ADD REPLY • link 8 weeks ago by ATpoint 89k

0

Entering edit mode

To play the devil advocate, Seurat function names are FindMarkers or FindAllMarkers which characterize most representative marker genes of a given cluster versus another cluster or against all other clusters. However, the logFC is highly biais in their calculation and should be seen more as a confidence score rather than a strict logFC, and the documentation should mention that clearly.

I usually use Seurat function to find marker genes and plot their expression on UMAP but for DEG a better way to go is pseudobulk and common RNA-seq DE tools like limma/DESeq2/EdgeR...

ADD REPLY • link 8 weeks ago by Bastien Hervé 6.5k

0

Entering edit mode

should be seen more as a confidence score rather than a strict logFC

I would exactly not see it as a confidence score as they do not provide variance or confidence intervals. The value can be gigantically large, but this does not increase confidence if driven by variance and noise.

ADD REPLY • link 8 weeks ago by ATpoint 89k

0

Entering edit mode

In the last version of Seurat, presto is used by default (if found in the environment) for logFC calculation like in FindMarkers()

In Seurat v5, we use the presto package (as described here and available for installation here), to dramatically improve the speed of DE analysis, particularly for large datasets

https://satijalab.org/seurat/articles/pbmc3k_tutorial.html

ADD REPLY • link 8 weeks ago by Bastien Hervé 6.5k

0

Entering edit mode

Seurat's default behavior for Find(All)Markers is indeed to use presto whenever available, as you suggest. However it only uses presto to calculate the p-value (and even then it adjusts it using Bonferroni correction, unlike the Benjamini-Hochberg correction in presto itself): https://github.com/satijalab/seurat/blob/e44cb2ce21aff3cbd94c2ddf24f9a1db6745e121/R/differential_expression.R#L2502C3-L2502C64

The logFC, in the meantime, is calculated independently by Seurat itself: https://github.com/satijalab/seurat/blob/e44cb2ce21aff3cbd94c2ddf24f9a1db6745e121/R/differential_expression.R#L787C3-L797C4

ADD REPLY • link 8 weeks ago by e.r.zakiev ▴ 260

0

Entering edit mode

So they want to stick with their own way of calculating logFC I guess.

I have already opened an "issue" (more a discussion) on the use of the pseudocounts that was disturbing the logFC results. They advocate that it was to reduce the number of DE genes with low expression value.

ADD REPLY • link 8 weeks ago by Bastien Hervé 6.5k