How to fix extreme outliers not picked up by DESeq2?
2
0
Entering edit mode
3.6 years ago

Hey internet,

I've been working with some bulk RNA-seq data with 44 samples, and I noticed a handful of genes that have large log2-fold changes that were quite tantalizing. However, when visually inspected with plotCounts(), it's very clear that there are EXTREME outliers that are driving the effect.

For instance: ADAMTS1 (a neat little metallopeptidase gene) had raw counts between ~400 to ~2000 for almost all the samples; however, there is one sample that has ~150,000 counts, ~75x the next highest count. While I hoped that the Cook's Distance/Trimmed Mean Replacement algorithm in DESeq2 would take care of this, it didn't, and the gene came up as signifcant with a -3.74 log2-fold difference. Granted a bit of an indirect comparison, but the mean of the raw counts of my control is ~8000 with the outlier remaining and ~1000 with the outlier removed. By contrast, the mean of cases is ~990, suggesting that this gene is unlikely to be different when the outlier is removed.

DESeq2 obviously didn't replace this clear outlier which is >6SD from the mean, so I thought I'd try to replace it manually. I'm making my DESeq object (dds) from TXImport, so I tried to extract the count matrix and tried to re-enact the preceding code for DESeqDataSetFromTximport() with DESeqDataSetFromMatrix(). This way, I can manually change the count matrix before making the dds, which I can't figure out how to do in the Txi object. Alas, the resulting DEG tables were not equivalent, and so now I'm stuck.

Any suggestions?

#Regular DESeq2 with object renamed dds_txi
dds_txi <- DESeqDataSetFromTximport(txi.aff.FA, colData = sample_table_affective_FA, design = ~Batch + Gender + RIN + affective)
dds_txi$affective <- relevel(dds_txi$affective, ref = "control")
dds_txi <- DESeq(dds_txi)
res.txi <- results(dds_txi, alpha = 0.05) 

#DESeq2 from count matrix
txi.aff.FA.counts <- txi.aff.FA$counts
txi.aff.FA.counts.round <- round(txi.aff.FA.counts)
mode(txi.aff.FA.counts.round) <- "integer"
dds_txi_count <- DESeqDataSetFromMatrix(txi.aff.FA.counts.round, colData = sample_table_affective_FA, design = ~Batch + Gender + RIN + affective)
dds_txi_count$affective <- relevel(dds_txi_count$affective, ref = "control")
dds_txi_count <- DESeq(dds_txi_count)
res.txi.count <- results(dds_txi_count, alpha = 0.05)
DESeq2 RNA-Seq Outlier Cook's Distance TXImport • 1.7k views
ADD COMMENT
1
Entering edit mode
3.6 years ago

The reason why DESeq is not filtering your outliers is set out in the DESeq2 users manual:

Note that with continuous variables in the design, outlier detection and replacement is not automatically performed, as our current methods involve a robust estimation of within-group variance which does not extend easily to continuous covariates. However, users can examine the Cook’s distances in assays(dds)[["cooks"]], in order to perform manual visualization and filtering if necessary.

I'm guessing your RIN factor in the design is continuous? I think the point is that the Trimmed Mean Replacement procedure doesn't work if you can't divide the samples into discrete categories.

The reason that you are not getting the same results when you build the DESeqDataSet object from the counts table, rather from the txi object is that the txImport to DESeq2 conversion involves more than just rounding the counts. If you import your counts via txImport then the per-sample transcript weight effective gene length is used for normalisation purposes to counteract the effect of any change in isoform usage between conditions. But you should be able to alter the counts matrix using:

txi.aff.FA$counts[gene, sample] <- corrected_value

Ideally you'd want to be able to make it NA, but I don't know what effect of that would be.

ADD COMMENT
0
Entering edit mode

Yes! That is exactly why it is not doing the Trimmed Mean Replacement Procedure; RIN is a continous covariate. Thank you for noticing this. It makes more sense now.

I'm almost positive you're correct about why the two dds objects aren't the same as well. I was able to notice some of what made the two functions different but not enough to know that the lengths are used in downstream steps. My ability to follow and alter source code is very limited at my stage of working with R (or any programing language, object oriented or not).

I tried this approach of changing the count data in txi directly by repacing the outlier with the mean of the population without the outlier, and it worked just perfectly. One question that remains, however, is whether this will extend to other parts of the txi object like $abundance. My guess is probably not, but I'm curious if you know? Again, thanks!

ADD REPLY
0
Entering edit mode

No, it won't affect other areas. the txi "object" is just a list of matrices - nothing more fancy than that.

ADD REPLY
0
Entering edit mode
3.6 years ago

Hi,

Did you try to apply the lfcShrink() function from DESeq2?

lfcShrink() function (have a look into the function doc: https://rdrr.io/bioc/DESeq2/man/lfcShrink.html) shrinks the log-fold change of genes that are highly variable among replicates or have low read counts and, therefore, the estimates are more prone to uncertainty and error. This function is applied after you get the differential expression results. It will not change the p-value, but will decrease the log-fold change towards zero (the official DESeq2 vignette provide some examples).

I hope this helps,

António

ADD COMMENT
0
Entering edit mode

Thanks for the suggestion. Certainly using the lfcShrink() function would make the log2-fold change more accurate in this case, but it unfortunately won't change a few things that I'd like to fix: 1) The significance, because it's very likely that there is not a significant difference between these two conditions for this gene, 2) the value itself, as I'm going to use these expression values from the dds object (after VST and then correcting for covariates with limma) for WGCNA, and the aberrant value would still hurt this analysis. Granted, there are only a few genes like this, but I'm trying my best to make the data as accurate as possible, especially for WGCNA, as compeltion of the rest of our R01 depends picking out good hubs genes.

I think the approach outlined above is going to work best, but reading more about lfcShrink() was very helpful, so thanks for the reference!

ADD REPLY

Login before adding your answer.

Traffic: 2665 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6