Question

How to fix extreme outliers not picked up by DESeq2?

0

Entering edit mode

4.9 years ago

fisher.w.daniel • 0

Hey internet,

I've been working with some bulk RNA-seq data with 44 samples, and I noticed a handful of genes that have large log2-fold changes that were quite tantalizing. However, when visually inspected with plotCounts(), it's very clear that there are EXTREME outliers that are driving the effect.

For instance: ADAMTS1 (a neat little metallopeptidase gene) had raw counts between ~400 to ~2000 for almost all the samples; however, there is one sample that has ~150,000 counts, ~75x the next highest count. While I hoped that the Cook's Distance/Trimmed Mean Replacement algorithm in DESeq2 would take care of this, it didn't, and the gene came up as signifcant with a -3.74 log2-fold difference. Granted a bit of an indirect comparison, but the mean of the raw counts of my control is ~8000 with the outlier remaining and ~1000 with the outlier removed. By contrast, the mean of cases is ~990, suggesting that this gene is unlikely to be different when the outlier is removed.

DESeq2 obviously didn't replace this clear outlier which is >6SD from the mean, so I thought I'd try to replace it manually. I'm making my DESeq object (dds) from TXImport, so I tried to extract the count matrix and tried to re-enact the preceding code for DESeqDataSetFromTximport() with DESeqDataSetFromMatrix(). This way, I can manually change the count matrix before making the dds, which I can't figure out how to do in the Txi object. Alas, the resulting DEG tables were not equivalent, and so now I'm stuck.

Any suggestions?

#Regular DESeq2 with object renamed dds_txi
dds_txi <- DESeqDataSetFromTximport(txi.aff.FA, colData = sample_table_affective_FA, design = ~Batch + Gender + RIN + affective)
dds_txi$affective <- relevel(dds_txi$affective, ref = "control")
dds_txi <- DESeq(dds_txi)
res.txi <- results(dds_txi, alpha = 0.05) 

#DESeq2 from count matrix
txi.aff.FA.counts <- txi.aff.FA$counts
txi.aff.FA.counts.round <- round(txi.aff.FA.counts)
mode(txi.aff.FA.counts.round) <- "integer"
dds_txi_count <- DESeqDataSetFromMatrix(txi.aff.FA.counts.round, colData = sample_table_affective_FA, design = ~Batch + Gender + RIN + affective)
dds_txi_count$affective <- relevel(dds_txi_count$affective, ref = "control")
dds_txi_count <- DESeq(dds_txi_count)
res.txi.count <- results(dds_txi_count, alpha = 0.05)

DESeq2 RNA-Seq Outlier Cook's Distance TXImport • 2.7k views

ADD COMMENT • link updated 4.9 years ago by antonioggsousa 3.4k • written 4.9 years ago by fisher.w.daniel • 0

score 1 · Answer 1 · 2020-08-27

The reason why DESeq is not filtering your outliers is set out in the DESeq2 users manual:

Note that with continuous variables in the design, outlier detection and replacement is not automatically performed, as our current methods involve a robust estimation of within-group variance which does not extend easily to continuous covariates. However, users can examine the Cook’s distances in assays(dds)[["cooks"]], in order to perform manual visualization and filtering if necessary.

I'm guessing your RIN factor in the design is continuous? I think the point is that the Trimmed Mean Replacement procedure doesn't work if you can't divide the samples into discrete categories.

The reason that you are not getting the same results when you build the DESeqDataSet object from the counts table, rather from the txi object is that the txImport to DESeq2 conversion involves more than just rounding the counts. If you import your counts via txImport then the per-sample transcript weight effective gene length is used for normalisation purposes to counteract the effect of any change in isoform usage between conditions. But you should be able to alter the counts matrix using:

txi.aff.FA$counts[gene, sample] <- corrected_value

Ideally you'd want to be able to make it NA, but I don't know what effect of that would be.

score 0 · Answer 2 · 2020-08-27

0

Entering edit mode

4.9 years ago

antonioggsousa 3.4k

Hi,

Did you try to apply the lfcShrink() function from DESeq2?

lfcShrink() function (have a look into the function doc: https://rdrr.io/bioc/DESeq2/man/lfcShrink.html) shrinks the log-fold change of genes that are highly variable among replicates or have low read counts and, therefore, the estimates are more prone to uncertainty and error. This function is applied after you get the differential expression results. It will not change the p-value, but will decrease the log-fold change towards zero (the official DESeq2 vignette provide some examples).

I hope this helps,

António

ADD COMMENT • link 4.9 years ago by antonioggsousa 3.4k

0

Entering edit mode

Thanks for the suggestion. Certainly using the lfcShrink() function would make the log2-fold change more accurate in this case, but it unfortunately won't change a few things that I'd like to fix: 1) The significance, because it's very likely that there is not a significant difference between these two conditions for this gene, 2) the value itself, as I'm going to use these expression values from the dds object (after VST and then correcting for covariates with limma) for WGCNA, and the aberrant value would still hurt this analysis. Granted, there are only a few genes like this, but I'm trying my best to make the data as accurate as possible, especially for WGCNA, as compeltion of the rest of our R01 depends picking out good hubs genes.

I think the approach outlined above is going to work best, but reading more about lfcShrink() was very helpful, so thanks for the reference!

ADD REPLY • link 4.9 years ago by fisher.w.daniel • 0