How are Outliers determined in DEseq2?
2
0
Entering edit mode
3.9 years ago

How are outliers determined in DEseq2?

I have aligned my samples using STAR and did differential expression analysis with DEseq2. In the output there are options to view different "statuses." Low (Mean count value less than 10), OK (Mean count value higher than 10), and Outlier.

It is not entirely clear how Outliers are determined. Does anyone know?

I have used multiple alignment and differential expression analyses and this pipeline yields the smallest number of significant genes in DE. But many of the genes that show up in the Outliers list are significant in other analyses.

Thanks!

RNA-Seq DEseq2 STAR • 2.0k views
1
Entering edit mode
3.9 years ago

DESeq2 uses Cook's distance, provided you have sufficient replicates. You can tweak the cutoff used with the cooksCutoff option to results().

0
Entering edit mode
8 months ago
ATpoint 55k

There is also this illustrative answer from the DESeq2 developer over at Bioconductor:

https://support.bioconductor.org/p/92428/#92447

I'll start with just a quick explanation of Cook's distance: it measures within each gene, for each sample, how removing that sample would change the LFCs (all of the coefficients implied by the design and estimated by DESeq2).

So if you have e.g. 3 samples vs 2 samples, and the counts for a gene are [10,10,10] vs [15, 1000], you can see how the Cook's distance will be high for the two samples. Removing either one changes the LFC for the comparison of the two groups. However, if it were [10,10,10] vs [50,50], the two samples "support" each other, such that removing one doesn't change the LFC at all. Hence, we find Cook's to be useful for identifying outliers.

However, having 2 samples is really problematic to try to identify outliers. In particular, there's really no way to say if one or the other sample is an "outlier", or if it's just a gene with high dispersion (in addition to increased expression, e.g. in the above example). With 3 samples, it's really the bare minimum, but nevertheless we do enable filtering of genes which may contain extreme count outliers.