Question

scRNAseq Differential expression analysis

0

Entering edit mode

7 months ago

MVJ ▴ 10

Hello everyone!

I am a student that recently started working with transcriptomics data.

I am trying to conduct my first single cell data analysis of an organoid model using mainly Seurat. I tried to conduct a differential expression analysis between different clusters using the FindMarkers() function but I have some questions.

I would like to understand what test would be best to use if MAST or Wilcoxon Rank Sum. I looked at recently published scRNA-seq analysis and they are both commonly used.

I would also like to ask you how you approach the selection of the top DEGs, namely what are the best p_value_adjust and avg_logFC thresholds to use.

I am really new to the field and I would really appreciate any kind of insight.

scRNA-seq Differential-Expression Seurat • 1.4k views

ADD COMMENT • link updated 7 months ago by ATpoint 82k • written 7 months ago by MVJ ▴ 10

3

Entering edit mode

My personal pessimistic opinion: Just do whatever you want. Differential gene expression in scRNA-seq is broken anyway.

Edited (because people felt the need to insult+flame the original person I cited):

The results of different methods applied to the same scRNA-seq data differ substantially.

This is true even for fold changes, as shown below for Seurat and Scanpy.

The differences between selected transcript "markers" are even larger: https://t.co/pH4Rh3wQZv via @davisjmcc pic.twitter.com/dcSkeDOhBf
— Prof. Nikolai Slavov (@slavov_n) October 18, 2022

ADD REPLY • link 7 months ago by dsull ★ 5.9k

2

Entering edit mode

EDIT: I was skeptical because I am a little biased against Lior. The original claim of discrepancy in results does look concerning and warrants further research.

ADD REPLY • link 7 months ago by Ram 43k

1

Entering edit mode

He (or rather, the person whom he cited -- Nikolai Slavov) has identified a major problem with differential gene expression mathematics -- there's nothing to believe or not believe or to feel; the programs+libraries give starkly discordant results. There's no good solution to these problems -- just as there is no perfect solution to single-cell RNAseq normalization. Single-cell RNAseq has many major problems without solutions. Please don't disregard an important result (substantiated both before+after by other scientists) just because it's from someone whom you have a negative opinion of. There's no reason to cast unnecessary shade.

ADD REPLY • link 7 months ago by dsull ★ 5.9k

1

Entering edit mode

I try contributing to biostars (I'm on here every day trying to assist members of the community), helping others, citing important results, and then people see it fit to attack+flame my Ph.D. advisor and get upvoted.

If you'd like, I'd be happy to edit my post to cite Nikolai Slavov or Mark Sanborn instead who made the same observations (I cited Lior because he provided mathematical details).

I was trying to help the original poster, who is a student like myself, by pointing out discrepancies in the field and how it probably doesn't matter what method you use for extracting the biological signal most people are interested in -- single-cell analysis methods (and even "bulk" methods) are still developing and we really don't know an "optimal" solution. My colleagues are wonderful people who have put a lot of work into investigating such observations.

Anyway, please try keeping biostars a positive community :)

ADD REPLY • link 7 months ago by dsull ★ 5.9k

1

Entering edit mode

Thank you for your contributions dsull much appreciated.

I would also say that when it comes to science we should be steering clear of personal biases and treat information based on its actual content.

ADD REPLY • link 7 months ago by Istvan Albert 100k

1

Entering edit mode

I agree that no important result should be disregarded owing to bias but I'd by lying if I said what I have heard and seen about Lior has not affected how his words impact me. I don't mean to and I am nowhere near qualified to attack him, but his negativity kinda leaves a bad taste after viewing his interactions. He is an excellent scientist, no doubt. I just wish he did not come off as abrasive as he does.

ADD REPLY • link 7 months ago by Ram 43k

0

Entering edit mode

I appreciate this statement and it largely mirrors my own. Lior sometimes makes extremely salient and important points. Other times, he's thoughtlessly overly aggressive and personal - as many people can be. That does not detract from his (or his group's) contributions or observations...but it does introduce a bias.

ADD REPLY • link 7 months ago by jared.andrews07 ★ 16k

1

Entering edit mode

Thank you so much for all the replies! I am new to the field and all of the insights are really helpful and much appreciated

ADD REPLY • link 7 months ago by MVJ ▴ 10

score 2 · Answer 1 · 2023-09-22

Pseudobulk methods, so essentially mimicking bulk differential expression is preferred as counts are no longer sparse and noise gets reduced. This requires biological replicates, often not available.

For single-cell level DE I personally have good experience with both limma-voom and limma-trend, the latter based in the logcpms for example from the scuttle package. It is critical to properly prefilter, as noted in a benchmark study from Soneson and Robinson, Nature Methods 2018. Prefiltering removes very sparse genes, avoiding large and unreliable logFCs ald significances. Adapting the Soneson recommendations, I require that at least one of two groups expresses a gene with 1 CPM in at least 25% of cells. Inspecting logFCs and MA-plots for such a DE analysis to my eye looks much more clean and reasonable than unfiltered DE results which are a complete mess. Additionally, I always use limma::treat to test against a minimum fold change of 1.2, hence enforcing some minimal fold change, as large cell numbers give a lot of power. This power can make tiny and meaningless logFCs significant, hence this minimal FC threshold.

Ram · Answer 2 · 2023-09-22

The single-cell data I have usually comes from multiple donors. I compute the cluster pseudobulk per donor (by summing counts) and apply edgeR as though it were bulk expression. Typically about 10% of the clusters (clusters with <1% of cells) don't have sufficient counts for this; but I wouldn't trust any DE method on those clusters.

I've experimented with jackknifing the cells (splitting into 5 randomly-split "pseudo-donors") when there's no other way to group cells together and haven't really liked the results.

I'm aware that edgeR is, particularly in benchmarks, applied at the single-cell level (i.e., on the entire count matrix); but it's much faster and IMO more reliable to apply it in the way suggested by the vignette: https://bioconductor.org/packages/release/bioc/vignettes/Glimma/inst/doc/single_cell_edger.html

I would also like to ask you how you approach the selection of the top DEGs, namely what are the best p_value_adjust and avg_logFC thresholds to use.

My go-to is FDR < 0.05 and |logFC| > log(1.25).