differentially expressed genes among multiple groups
1
0
Entering edit mode
3.7 years ago
Iván ▴ 60

Hi,

I'm re-analyzing an Affymetrix array with 200+ samples distributed across 38 groups. They refer to multiple cell types (CD8+, CD4+, Granulocytes, etc.), so there isn't a "control" condition on which the groups are to be compared against. I've implemented code to RMA-normalize the CEL files and perform pair-wise comparisons for each two cell types of interest, but I'm stuck on how to extract DEGs among 3 or more groups?

My design formula considers batch effects (as originally reported by the study's authors), and is as following:

design <- model.matrix(~0+group+batch)

where group refers to samples cell types.

I then go on to perform pairwise comparisons and extract DEGs, examples of which are:

contr.matrix <- makeContrasts(
  BASOPHILSvsEARLYBCELLS = Basophils - EarlyB, # (coef = 1)
  CD8NAIVEvsCD8EM = CD8Naive - CD8EM,          # (coef = 2)
  EOSINOPHILSvsMONOCYTES = Eosino - Monocyte,  # (coef = 3)
  levels = colnames(design)
)

and to extract, for instance, DEGs between basophils and early B-cells I do:

fit <- lmFit(expr.rma, design)

fit2 <- contrasts.fit(fit, contr.matrix)

fit2 <- eBayes(fit2)

tt <- topTable(fit2, coef = 1, adjust.method = "BH", n=Inf)

but how would I proceed to extract DEGs among any of these groups?

limma microarray expression degs RNA-Seq • 1.7k views
ADD COMMENT
0
Entering edit mode
3.7 years ago
ATpoint 82k

Each coefficient corresponds to one column of your contrast matrix, so you simply have to change the coef parameter according to the contrast/column you want to test.

BASOPHILSvsEARLYBCELLS_tt <- topTable(fit2, coef = 1, adjust.method = "BH", n=Inf)
CD8NAIVEvsCD8EM_tt        <- topTable(fit2, coef = 2, adjust.method = "BH", n=Inf)
EOSINOPHILSvsMONOCYTES    <- topTable(fit2, coef = 3, adjust.method = "BH", n=Inf)

To be sure, can you show the design matrix?

ADD COMMENT
0
Entering edit mode

Yes, that's how I'm extracting the DEGs from each 2-way comparison. But I'd like to know whether it would be possible to calculate DEGs all at once among e.g Basophils X Early B-cells X CD8-naive X CD8EM x Monocytes X Eosinophils, as one would do by an ANOVA analysis.

My design is quite big because there are a lot of samples and groups (I don't want to perform all-vs-all comparison, just these groups of interest). I pasted it here https://pastebin.com/Nf4D3PSJ (note that HTAxxx columns correspond to batches).

Edit: I guess I could perform all pairwise comparisons and just calculate the union of DEGs across comparisons. But I'm not sure type 1 error would be properly controlled using this strategy.

ADD REPLY
0
Entering edit mode

Section 3.2.6 of the manual https://www.bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf might be of interest to make ANOVA-like tests that detect whether there are any DEGs among all indicated groups.

ADD REPLY
0
Entering edit mode

Thank you for pointing me to this guide (I was only reading limma since I'm working on array data)!

I've adapted my code to read:

fit <- lmFit(expr.rma, design)
fit2 <- eBayes(fit)
tt <- topTable(fit2, coef = 1:5, adjust.method = "BH", n=Inf)

However, this yields very low P-values with probes that present strikingly similar expression profiles. This is the first few lines of tt:

> head(tt)
                        Basophils    CD4CM    CD4EM    CD8CM    CD8EM  AveExpr
AFFX-HSAC07/X00351_3_at  14.24028 14.15263 14.18607 14.20181 14.22044 14.09083
201492_s_at              14.40483 14.38230 14.36668 14.40990 14.42368 14.14275
213867_x_at              14.44159 14.37452 14.40886 14.44252 14.43441 14.36701
200801_x_at              14.32623 14.28367 14.33340 14.32969 14.36496 14.25970
200031_s_at              14.29166 14.12298 14.09500 14.17429 14.15840 14.03807
AFFX-hum_alu_at          14.38317 14.32959 14.30506 14.34855 14.33721 14.32918
                               F       P.Value     adj.P.Val
AFFX-HSAC07/X00351_3_at 89780.10 6.079421e-292 1.394862e-287
201492_s_at             78894.11 4.097506e-287 4.700658e-283
213867_x_at             76830.96 4.003192e-286 3.061641e-282
200801_x_at             69738.20 1.661760e-282 9.531857e-279
200031_s_at             59493.76 1.428835e-276 6.556640e-273
AFFX-hum_alu_at         57484.16 2.744180e-275 1.049375e-271

Also, when I run summaryTests(fit2), everyone is significant in all comparisons but the batch columns! See: https://pastebin.com/jHiYNYuN

What am I missing here? It seems I'm actually testing for expression, thus all probes appear significant - but I have included an intercept in my design formula. Should I only include the groups to be compared in the design formula?

EDIT: my design:

design <- model.matrix(~0+group+batch)

ADD REPLY
0
Entering edit mode

I realize now that I should be using a design of the form

design <- model.matrix(~group+batch), where (Intercept) now shows in the first column, and now I have

> tt <- topTable(fit2, coef = 2:4, adjust.method = "BH", n=Inf)
> head(tt)
                   CD4EM    CD8CM    CD8EM  AveExpr        F      P.Value
205821_at   -0.157680954 5.844033 5.898132 8.923262 627.9269 1.894770e-22
207979_s_at  0.080600853 3.712907 4.192378 7.333154 303.4006 7.873893e-19
213915_at    0.838234928 3.902816 5.497322 9.059972 283.0714 1.731043e-18
205758_at   -0.020585088 6.756080 6.827124 9.711344 229.5047 1.857388e-17
210606_x_at -0.002967552 3.152527 3.942116 7.352508 223.3591 2.521977e-17
207795_s_at -0.069153206 3.275952 4.294975 7.295654 197.8585 9.850590e-17
               adj.P.Val
205821_at   4.347361e-18
207979_s_at 9.032931e-15
213915_at   1.323902e-14
205758_at   1.065398e-13
210606_x_at 1.157285e-13
207795_s_at 3.766866e-13

My question now is whether I should subset my whole dataset prior to perform this only to retain groups of interest (in this case, samples related to the comparisons between Basophils x CD4CM x CD4EM x CD8CM x CD8EM), or if using coef = 2:4 is enough to obtain these DEGs across these specific groups, which are only a portion of all groups present in this dataset? I also have a lot of other coefficients calculated for both the remaining groups as well as the batches.

ADD REPLY
0
Entering edit mode

Is it really of interest for your research to know which genes are differential between a basophil and a CD8+? These are fundamentally different cell types. I would really try and only perform the meaningful comparisons that help driving your research and avoid making unnecessary comparisons. I'd stick with the pairwise comparisons instead of these Anova-like tests. What question do you want to answer?

ADD REPLY
0
Entering edit mode

This is a quite complete dataset of cellular lineages. I used basophile as an example for simplicitiy because it was at the start of my sample array. What I would like to really see was differences between the (somewhat related) Megakaryocyte / erythroid progenitor VS granulocyte/monocyte progenitor VS Common myeloid progenitor VS Naive CD4+ VS Naive CD8+.

ADD REPLY
0
Entering edit mode

Create new combined levels / factors in your input metadata (via paste() or paste0()) such that you can then conduct the comparisons that you want.

ADD REPLY
0
Entering edit mode

Yes, I have each of the groups from my previous message as a factor within the group variable that compose my design. At this moment I'm leaning towards @ATPoint's suggestion of performing solely 2-group comparisons, although I would like to know whether my previous reasoning and code is appropriate for an ANOVA-like comparison, particularly in the presence of the batch covariate as all examples that I saw in other posts had only a simple design of the form design <- model.matrix(~ group)

ADD REPLY

Login before adding your answer.

Traffic: 1808 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6