How are Outliers determined in DEseq2?

How are Outliers determined in DEseq2?

0

Entering edit mode

6.4 years ago

john.perish • 0

How are outliers determined in DEseq2?

I have aligned my samples using STAR and did differential expression analysis with DEseq2. In the output there are options to view different "statuses." Low (Mean count value less than 10), OK (Mean count value higher than 10), and Outlier.

It is not entirely clear how Outliers are determined. Does anyone know?

I have used multiple alignment and differential expression analyses and this pipeline yields the smallest number of significant genes in DE. But many of the genes that show up in the Outliers list are significant in other analyses.

Thanks!

RNA-Seq DEseq2 STAR • 4.1k views

ADD COMMENT • link updated 3.2 years ago by ATpoint 82k • written 6.4 years ago by john.perish • 0

1

Entering edit mode

6.4 years ago

Devon Ryan 104k

DESeq2 uses Cook's distance, provided you have sufficient replicates. You can tweak the cutoff used with the cooksCutoff option to results().

ADD COMMENT • link 6.4 years ago by Devon Ryan 104k

0

Entering edit mode

3.2 years ago

ATpoint 82k

There is also this illustrative answer from the DESeq2 developer over at Bioconductor:

https://support.bioconductor.org/p/92428/#92447

I'll start with just a quick explanation of Cook's distance: it measures within each gene, for each sample, how removing that sample would change the LFCs (all of the coefficients implied by the design and estimated by DESeq2).

So if you have e.g. 3 samples vs 2 samples, and the counts for a gene are [10,10,10] vs [15, 1000], you can see how the Cook's distance will be high for the two samples. Removing either one changes the LFC for the comparison of the two groups. However, if it were [10,10,10] vs [50,50], the two samples "support" each other, such that removing one doesn't change the LFC at all. Hence, we find Cook's to be useful for identifying outliers.

However, having 2 samples is really problematic to try to identify outliers. In particular, there's really no way to say if one or the other sample is an "outlier", or if it's just a gene with high dispersion (in addition to increased expression, e.g. in the above example). With 3 samples, it's really the bare minimum, but nevertheless we do enable filtering of genes which may contain extreme count outliers.

ADD COMMENT • link 3.2 years ago by ATpoint 82k

Login before adding your answer.

Similar Posts

Loading Similar Posts

Traffic: 2896 users visited in the last hour

Content Search
Users
Tags
Badges

Help About
FAQ

Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the

version 2.3.6