Hello guys,

I have a question regarding differential expression (DE) with limma/voom in an RNA-Seq experiment of ~250 samples. I have 4 different point mutations that are **mutually exclusive** and I wish to identify DE isoforms specific for each mutation. Additionally, I have reason to believe that the mutations cause similar differential expression for some isoforms and I also wish to identify those.

**My questions are:**

- What is the best approach to identify the differences and the similarities of those mutations?
- How do I adress the issue of confounding?

Specifically, **I have 3, 4, 6 and 17 samples for each mutation respectively**, as well as **~200 samples as a control group**.

The approach I have come up with so far is to:

**1)** perform a DE analysis for each mutation comparing it to the control group (which does not include the other mutations), like this:

*design.matrix <- model.matrix(~ factor(mut1), data)*

*design.matrix <- model.matrix(~ factor(mut2), data)*... and so on

**2)** after doing this for each of the 4 mutations, combine them into a single binary variable and perform a DE analysis comparing all of them against the control group:

design.matrix <- model.matrix(~ factor(all.mutations), data

My way of thinking is that since the mutations are mutually exclusive, I can compare each mutation against the control group **(which does not include the other mutations)** in order to identify DE isoforms **specific** to each mutation. Afterwards by combing them, I hope to highlight isoforms that are similarly DE for all mutations. If I check the overlap of the last analysis with the preceding 4 I should be able to identify at least some isoforms that are affected in a common way. Is it maybe sufficient to just compare the overlap of the first 4 analyses without the 2nd step?

As to the 2nd question. Is there a rule of thumb as to how many confounders I can add in a limma analysis while **avoiding overfitting**? Since in a differential expression analysis we don't really have "events" I am unsure how to determine the number of confounders I can adjust for. Especially for the mutation with the smallest subset (only 3 mutations) I am unsure if the relatively large control group of ~200 samples permitts me to adjust for multiple confounders.

Any input is welcome, thanks in advance!

Stefan

Since the mutations are mutually-exclusive I'd create a vector

`Muts`

, in which the mutation status is specified for each subject, i.e.:I would create a design matrix as below:

Then fit the linear model:

Then extract the contrasts:

You might also ask this question directly at the BioC. Gordon Smyth (as well as other authors and maintainers of popular packages) are outstandingly responsive, especially to well-written questions like yours.

Thank you! I will try this as well