I want to compare the mRNA expression in cancer patients that have a specific mutation vs. cancer patients that don't have said mutation, using data from TCGA. I'm trying to combine data from two different tumor types (LUSC and LUAD), which I know is not appropriate to do given their completely different molecular profiles, but I'm wondering if it's at all possible.
Out of the 27 patients with the mutation, 24 belong to LUSC and 3 belong to LUAD. Of the 14 patients without the mutation, 12 belong to LUAD and 2 belong to LUSC. If I were to compare the mutated group vs. non-mutated group without taking into account the tumor type (~ group), I'd just be looking at differences between LUSC and LUAD, which doesn't tell me anything about the effect of the mutation.
0) My original approach was to include the tumor type in the experimental design (~ project + group). However, this finds no differences, which is expected as there are only 2 and 3 samples that can be used to "correct" for the tumor type.
So, I've thought of two different approaches to try to tackle this. Both of them use patients for which the mutation has not been tested (i.e. there is no WGS or WXS data available), meaning we don't know which group they belong to, which is why I'm not sure if this is statistically correct. In my head, it makes sense to include patients for which the group is unknown, because this would help differentiate between what's different due to the mutation and due to the tumor type.
1) Do DEA for all patients in LUSC (n = ~ 500) vs. LUAD (n = ~ 500) (including mutated and non-mutated patients, as well as those patients that weren't tested for mutations) (~ project) to see which DEGs are caused by the tumor type. The "true" differentially expressed genes would then be the DEGs that came up in the simplest of analysis (~ group) but weren't DEGs in this analysis (~ project), which means they were actually due to the mutation and not to the tumor type. This first approach finds a few differences that I think make sense biologically speaking.
2) For the "group" variable, assign all untested patients (n = ~1000) as "unknown" and then do DEA incorporating both the "group" and "tumor type" as factors (~ project + group). Then, I'd get the DEGs for the contrast that I'm interested in, which is mutation vs. no.mutation for the "group" variable (ignoring the "unknown" group), using
results(dds, alpha = 0.05, contrast=c("group", "mutation", "no.mutation")). This second approach finds the same differences as the first approach plus some others that make biological sense given the mutation.
I'm sure the original approach is the purest statistically speaking, but due to the uneven distribution of the samples across the two variables (project and group), I think it's impossible to find any differences this way. So my question is: are the first and second approaches statistically valid?
Here's another related question: if I were to only compare mutated vs. non-mutated patients within a single tumor type (i.e., LUSC), would it be better to include the untested patients (group = "unknown") or to not include them? My intuition would suggest that a larger sample size would help during normalization, but I don't see a change in the normalized expression of the tested patients! And even so, the number of DEGs gets reduced compared to when I only include the tested patients, which makes me trust the DEGs that were common to both analysis even more. But why does this happen? Are the additional samples adjusting the statistical analysis, even if it doesn't change the normalized values?
Sorry for making this so long and overly specific, but I couldn't find any posts about including samples for which the variable of interest was unknown, only to correct for a second variable.
Thank you in advance!
EDIT: about the extra question, I think the reason why I get slightly different results is because the "unknown" group has higher variability due to including both mutated and non-mutated patients, so this inflates the per-gene dispersion estimate for the "mutation" and "no.mutation" groups. In this case, the vignette recommends comparing the two groups by creating a dataset that doesn't include the more variable "unknown" group. I wonder if this is also applicable to the first question: it would seem like in theory the second approach would have less statistical power due to including a highly variable group in the analysis; but in practice it actually gives more DEGs than the first approach.