Hey internet,
I've been working with some bulk RNA-seq data with 44 samples, and I noticed a handful of genes that have large log2-fold changes that were quite tantalizing. However, when visually inspected with plotCounts(), it's very clear that there are EXTREME outliers that are driving the effect.
For instance: ADAMTS1 (a neat little metallopeptidase gene) had raw counts between ~400 to ~2000 for almost all the samples; however, there is one sample that has ~150,000 counts, ~75x the next highest count. While I hoped that the Cook's Distance/Trimmed Mean Replacement algorithm in DESeq2 would take care of this, it didn't, and the gene came up as signifcant with a -3.74 log2-fold difference. Granted a bit of an indirect comparison, but the mean of the raw counts of my control is ~8000 with the outlier remaining and ~1000 with the outlier removed. By contrast, the mean of cases is ~990, suggesting that this gene is unlikely to be different when the outlier is removed.
DESeq2 obviously didn't replace this clear outlier which is >6SD from the mean, so I thought I'd try to replace it manually. I'm making my DESeq object (dds) from TXImport, so I tried to extract the count matrix and tried to re-enact the preceding code for DESeqDataSetFromTximport() with DESeqDataSetFromMatrix(). This way, I can manually change the count matrix before making the dds, which I can't figure out how to do in the Txi object. Alas, the resulting DEG tables were not equivalent, and so now I'm stuck.
Any suggestions?
#Regular DESeq2 with object renamed dds_txi
dds_txi <- DESeqDataSetFromTximport(txi.aff.FA, colData = sample_table_affective_FA, design = ~Batch + Gender + RIN + affective)
dds_txi$affective <- relevel(dds_txi$affective, ref = "control")
dds_txi <- DESeq(dds_txi)
res.txi <- results(dds_txi, alpha = 0.05)
#DESeq2 from count matrix
txi.aff.FA.counts <- txi.aff.FA$counts
txi.aff.FA.counts.round <- round(txi.aff.FA.counts)
mode(txi.aff.FA.counts.round) <- "integer"
dds_txi_count <- DESeqDataSetFromMatrix(txi.aff.FA.counts.round, colData = sample_table_affective_FA, design = ~Batch + Gender + RIN + affective)
dds_txi_count$affective <- relevel(dds_txi_count$affective, ref = "control")
dds_txi_count <- DESeq(dds_txi_count)
res.txi.count <- results(dds_txi_count, alpha = 0.05)
Yes! That is exactly why it is not doing the Trimmed Mean Replacement Procedure; RIN is a continous covariate. Thank you for noticing this. It makes more sense now.
I'm almost positive you're correct about why the two dds objects aren't the same as well. I was able to notice some of what made the two functions different but not enough to know that the lengths are used in downstream steps. My ability to follow and alter source code is very limited at my stage of working with R (or any programing language, object oriented or not).
I tried this approach of changing the count data in txi directly by repacing the outlier with the mean of the population without the outlier, and it worked just perfectly. One question that remains, however, is whether this will extend to other parts of the txi object like $abundance. My guess is probably not, but I'm curious if you know? Again, thanks!
No, it won't affect other areas. the
txi
"object" is just a list of matrices - nothing more fancy than that.