Question

DESeq2 analysis: huge number of outliers and refitting

3

Entering edit mode

8.8 years ago

VHahaut ★ 1.2k

Hello!

I am quite new to bioinformatic so I hope my question will be clear enough.

I am trying to run a DESeq2 analysis on 25 bovine tumor samples. Among them I have two technical replicates of my unique control (I know is not ideal) and most of my "treated" samples have one technical replicate too. Before any DESeq analysis I had to drop a few samples because the quality of the RNA-seq was not good enough.

design = ~Group

Overview of colData

row.names   sample    Group
sample1     sample1   treated
sample2     sample1   treated
sample3     sample2   control
sample4     sample2   control

I tried two different approaches: Either start the DESeq analysis without specifying that I had technical replicates (dds)or using the collapseReplicates function based on the colData sample column to merge the reads (ddsCollapsed).

dds <- DESeqDataSetFromMatrix(matrix, colData, design)
ddsCollapsed<- collapseReplicates(dds, groupby= colData(dds)$sample, renameCols=T)

My problem lies in the DESeq analysis:

DESeq(dds)

estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 6787 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing

DESeq(ddsCollapsed)

estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 24224 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing

I am working with bovine ENSEMBL annotation which contains ~24660 entries...

I was really surprised by the number of outliers. Moreover, the MA plots from those two analysis are really not great (I join to this post the one of ddsCollapsed):

I have already red the supplementary data about Cook's distance.

So my questions are the following:

Do I have to worry about such high number of outliers? Is it common? What could be the reasons leading to those numbers?
If yes to (1), what can I do to overcome this trouble?
A unrelated question: Is it possible to put missing values (NA) in the colData table? I tried and got this error:

Error in t(hatmatrix %*% t(y)) :
"error in evaluating the argument 'x' in selecting a method for function 't': Error in hatmatrix %*% t(y) : non-conformable arguments"

Thanks for reading this long post! Any advice would be appreciated! Vincent

RNA-Seq R DESeq • 7.6k views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 8.8 years ago by VHahaut ★ 1.2k

Ram · Accepted Answer · 2015-07-01

The count outlier flagging is useful when there are a minority of outliers in the dataset, but as you have noted, something else is going on here with so many genes flagged. There are two reasons for so many genes being flagged as outlier: either the method for flagging outliers is not appropriate for the distribution of counts in your data and should be turned off (by setting minReplicatesForReplace=Inf and cooksCutoff=FALSE), or you have a sample which is a count outlier in almost every gene (which could be found using plotPCA as in the vignette). My recommendation if you don't find an obvious outlier sample which is contributing to most of these filtered genes, then turn off the filtering and inspect the top genes using plotCounts.

No, you can't include NA in the columns which are used for modeling. We need complete covariate information.