I have a question regarding differential expression (DE) with limma/voom in an RNA-Seq experiment of ~250 samples. I have 4 different point mutations that are mutually exclusive and I wish to identify DE isoforms specific for each mutation. Additionally, I have reason to believe that the mutations cause similar differential expression for some isoforms and I also wish to identify those.
My questions are:
- What is the best approach to identify the differences and the similarities of those mutations?
- How do I adress the issue of confounding?
Specifically, I have 3, 4, 6 and 17 samples for each mutation respectively, as well as ~200 samples as a control group.
The approach I have come up with so far is to:
1) perform a DE analysis for each mutation comparing it to the control group (which does not include the other mutations), like this:
design.matrix <- model.matrix(~ factor(mut1), data)
design.matrix <- model.matrix(~ factor(mut2), data)... and so on
2) after doing this for each of the 4 mutations, combine them into a single binary variable and perform a DE analysis comparing all of them against the control group:
design.matrix <- model.matrix(~ factor(all.mutations), data
My way of thinking is that since the mutations are mutually exclusive, I can compare each mutation against the control group (which does not include the other mutations) in order to identify DE isoforms specific to each mutation. Afterwards by combing them, I hope to highlight isoforms that are similarly DE for all mutations. If I check the overlap of the last analysis with the preceding 4 I should be able to identify at least some isoforms that are affected in a common way. Is it maybe sufficient to just compare the overlap of the first 4 analyses without the 2nd step?
As to the 2nd question. Is there a rule of thumb as to how many confounders I can add in a limma analysis while avoiding overfitting? Since in a differential expression analysis we don't really have "events" I am unsure how to determine the number of confounders I can adjust for. Especially for the mutation with the smallest subset (only 3 mutations) I am unsure if the relatively large control group of ~200 samples permitts me to adjust for multiple confounders.
Any input is welcome, thanks in advance!