Question

Issue with DESeq2 gene outlier detection when having multiple groups

0

Entering edit mode

20 months ago

Papyrus ★ 2.9k

Hi guys,

I'm analysing some data with DESeq2 which involves 4 groups, so I do some comparisons between some specific groups. I'm finding that, when labelling genes as outliers (hence giving NA p-values) DESeq2 uses the information from all the samples in each row. This leads to some outlier genes for some comparison not being adequately (for me) labelled.

See this example: I compare group3 vs group4, and the top result is a very obvious outlier: it has 0 counts for all samples in g3 and g4, except one of them, which has 1000 counts. It was not flagged as an outlier because samples in g1 and g2 have high levels of counts:

#######################################
# Analysis with all samples
#######################################

# This is the design
design(foo) <- ~0 + group
model.matrix(design(foo), colData(foo))
       groupg1 groupg2 groupg3 groupg4
s1_g1        1       0       0       0
s2_g1        1       0       0       0
s3_g1        1       0       0       0
s4_g2        0       1       0       0
s5_g2        0       1       0       0
s6_g2        0       1       0       0
s7_g3        0       0       1       0
s8_g3        0       0       1       0
s9_g3        0       0       1       0
s10_g4       0       0       0       1
s11_g4       0       0       0       1
s12_g4       0       0       0       1
attr(,"assign")
[1] 1 1 1 1
attr(,"contrasts")
attr(,"contrasts")$group
[1] "contr.treatment"


# Fit model
foo <- DESeq(foo, test = "Wald", parallel = T)

# Extract results comparing group 4 vs group 3
res <- as.data.frame(DESeq2::results(foo,
+                                contrast = list(c("groupg4"),
+                                                c("groupg3")),
+                                listValues = c(1, -1),
+                                pAdjustMethod = "BH",
+                                alpha = 0.05))
head(dplyr::arrange(res,pvalue))
                        baseMean log2FoldChange     lfcSE      stat       pvalue         padj
ENSMUSG00000095937.1  1053.02717    -25.0892237 3.7236480 -6.737807 1.607944e-11 3.003318e-07
ENSMUSG00000102500.1    94.16247      8.5057336 1.2968182  6.558925 5.419696e-11 5.061454e-07
ENSMUSG00000030757.13  465.86889     -0.5388158 0.1335390 -4.034895 5.462674e-05 3.401061e-01
ENSMUSG00000029754.13  136.73970      0.9530311 0.2500406  3.811505 1.381234e-04 6.449670e-01
ENSMUSG00000034457.10   15.72399     -2.0677856 0.5545838 -3.728536 1.925956e-04 7.194603e-01
ENSMUSG00000096370.8   188.22008     -0.8566324 0.2397271 -3.573365 3.524235e-04 8.425913e-01

# Look at the counts for the top gene:
counts(foo,normalized = T)["ENSMUSG00000095937.1",]
    s1_g1     s2_g1     s3_g1     s4_g2     s5_g2     s6_g2     s7_g3     s8_g3     s9_g3    s10_g4    s11_g4    s12_g4 
4162.4721 3140.3940 1398.9893  872.8687 2000.8897    0.0000    0.0000    0.0000 1060.7122    0.0000    0.0000    0.0000

This behavior does not happen if we subset the DESeq object to only contain groups 3 and 4. Then, this gene is correctly flagged as an outlier:

#######################################
# Only using the compared groups
#######################################

foo2 <- foo[,7:12]
foo2$group <- factor(foo2$group)

design(foo2) <- ~0 + group
model.matrix(design(foo2), colData(foo2))
       groupg3 groupg4
s7_g3        1       0
s8_g3        1       0
s9_g3        1       0
s10_g4       0       1
s11_g4       0       1
s12_g4       0       1
attr(,"assign")
[1] 1 1
attr(,"contrasts")
attr(,"contrasts")$group
[1] "contr.treatment"


# Fit model
foo2 <- DESeq(foo2, test = "Wald", parallel = T)

# Extract results comparing group 4 vs group 3
res2 <- as.data.frame(DESeq2::results(foo2,
+                                      contrast = list(c("groupg4"),
+                                                      c("groupg3")),
+                                      listValues = c(1, -1),
+                                      pAdjustMethod = "BH",
+                                      alpha = 0.05))
head(dplyr::arrange(res2,pvalue))
                        baseMean log2FoldChange     lfcSE      stat       pvalue         padj
ENSMUSG00000102500.1    35.85269      8.5051312 1.3046299  6.519191 7.068767e-11 1.322284e-06
ENSMUSG00000051985.12   49.43580      3.1169236 0.6825849  4.566353 4.962824e-06 4.641729e-02
ENSMUSG00000029754.13  146.64453      0.9528564 0.2444252  3.898356 9.684778e-05 5.151895e-01
ENSMUSG00000094800.8    80.41222     -4.3109181 1.1147858 -3.867037 1.101656e-04 5.151895e-01
ENSMUSG00000075334.2  1482.64888      0.4557637 0.1311512  3.475101 5.106618e-04 9.999976e-01
ENSMUSG00000023079.14  561.07149      0.6503654 0.1898563  3.425567 6.135187e-04 9.999976e-01
# The previous top gene is now flagged as NA
dplyr::arrange(res2,pvalue)["ENSMUSG00000095937.1",]
                     baseMean log2FoldChange    lfcSE      stat pvalue padj
ENSMUSG00000095937.1 176.7854      -25.08636 3.907085 -6.420737     NA   NA

So, my question is, is there any way to change DESeq2 filtering behaviour to flag outliers only for the groups being compared? Or do you think the best approach is just to subset the data for each comparison, separately? Have you ever encountered this issue?

thanks for your help

DESeq2 R RNA-seq • 831 views

ADD COMMENT • link updated 20 months ago by i.sudbery 19k • written 20 months ago by Papyrus ★ 2.9k

score 1 · Answer 1 · 2022-09-20

1

Entering edit mode

20 months ago

i.sudbery 19k

I'm not sure that its quite as simple as you say. Outlier detection works on how much a single sample is affecting the fitted coefficients. Presumably in your first case the outlying sample is still affecting the groupg3 coefficient just as much as it is in the second example.

However, two things could be resulting in the behavior you are seeing. Firstly, if you have more than a certain number of samples, DESeq replaces the outlier with a trimmed mean of the other samples, rather than marking the gene NA. You can switch this behavior off with DESeq(dds, minReplicatesForReplace=Inf).

Second, the threshold at which a gene is designated an outlier is dependent on both the number of samples, and the number of coefficients to be estimated. From the DESeq2 manual:

The default is to use the 99% quantile of the F(p,m-p) distribution (with p the number of parameters including the intercept and m number of samples).

So in the first case the cutoff is the 99% quantile of F(4, 8), while in the second case the threshold is the 99% centile of F(2, 4).

You can alter this threshold if you wish.

Lots more detail in the Outlier's section of the DESeq2 manual.

http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#approach-to-count-outliers

ADD COMMENT • link 20 months ago by i.sudbery 19k

0

Entering edit mode

Thanks a lot for taking the time to answer. I'm sorry, I am indeed using minReplicatesForReplace=Inf, because I want to mark the outliers, but forgot to include it when typing the example.

I also expected that the outlying sample would affect the group3 coefficient in the same way in both designs, but the thing is that, the Cook's distances change a lot with the design, that's why I initially thought maybe using the full model was somehow dampening the Cook's distance of the outlier:

counts(foo,normalized = T)["ENSMUSG00000095937.1",]
    s1_g1     s2_g1     s3_g1     s4_g2     s5_g2     s6_g2     s7_g3     s8_g3     s9_g3    s10_g4    s11_g4    s12_g4 
4162.4721 3140.3940 1398.9893  872.8687 2000.8897    0.0000    0.0000    0.0000 1060.7122    0.0000    0.0000    0.0000 

assays(foo)[["cooks"]]["ENSMUSG00000095937.1",]
         s1_g1          s2_g1          s3_g1          s4_g2          s5_g2          s6_g2          s7_g3          s8_g3 
0.018470474541 0.000666898518 0.026155874473 0.000768351629 0.115715770099 0.097503104284 0.093973458469 0.099228014040 
         s9_g3         s10_g4         s11_g4         s12_g4 
0.396930191339 0.000001959902 0.000001295067 0.000001252200 

assays(foo2)[["cooks"]]["ENSMUSG00000095937.1",]
              s7_g3           s8_g3           s9_g3          s10_g4          s11_g4          s12_g4 
     8.214905241000  8.945367523128 35.783308593148  0.000003927686  0.000002595326  0.000002509419

As you can see, I would need a very low Cook's value filter to flag the gene as an outlier in the full model, and this would lead to flagging thousands of genes as outliers.

See here another example which would look like an outlier for the groups 3 and 4 (this one was not flagged by the filter probably because it didn't reach the 99% quantile of F(4,8), but it has a low enough value that it could maybe be flagged by softly changing the Cook's cut-off threshold). Nonetheless, again the Cook's distances differences are more clear in the subset of the data model:

counts(foo,normalized = T)["ENSMUSG00000058603.10",]
    s1_g1     s2_g1     s3_g1     s4_g2     s5_g2     s6_g2     s7_g3     s8_g3     s9_g3    s10_g4    s11_g4    s12_g4 
537.06288 394.79153 206.62116 384.42496 354.58168  97.89824 581.50761 606.94055 613.09116   0.00000 973.12305   0.00000 

dplyr::arrange(as.data.frame(res),pvalue)["ENSMUSG00000058603.10",]
                      baseMean log2FoldChange    lfcSE       stat    pvalue padj
ENSMUSG00000058603.10 395.8369     -0.8887796 1.953566 -0.4549524 0.6491435    1
dplyr::arrange(as.data.frame(res2),pvalue)["ENSMUSG00000058603.10",]
                      baseMean log2FoldChange    lfcSE       stat pvalue padj
ENSMUSG00000058603.10 462.4437     -0.8886349 3.051692 -0.2911942     NA   NA

assays(foo)[["cooks"]]["ENSMUSG00000058603.10",]
        s1_g1         s2_g1         s3_g1         s4_g2         s5_g2         s6_g2         s7_g3         s8_g3         s9_g3 
0.12296053922 0.00115697258 0.14783345982 0.10147837623 0.05217921638 0.29845818685 0.00071155004 0.00008201613 0.00031380887 
       s10_g4        s11_g4        s12_g4 
0.78333375519 2.70595045140 0.66832480645 
assays(foo2)[["cooks"]]["ENSMUSG00000058603.10",]
       s7_g3        s8_g3        s9_g3       s10_g4       s11_g4       s12_g4 
 0.008799736  0.001039258  0.003974644  9.132780779 33.102203471  8.206535740

I also imagine that part of the problem here is having only n=3 per group, although it was very striking that the most significant result of the analysis with the full model was the first outlier.

ADD REPLY • link 20 months ago by Papyrus ★ 2.9k

0

Entering edit mode

I guess you could check yourself by removing the offending sample and checking how the coefficients change.

ADD REPLY • link 20 months ago by i.sudbery 19k