Question

including samples for which group is unknown to help adjust for another variable

1

Entering edit mode

10 months ago

pilargmarch ▴ 110

Hi,

I want to compare the mRNA expression in cancer patients that have a specific mutation vs. cancer patients that don't have said mutation, using data from TCGA. I'm trying to combine data from two different tumor types (LUSC and LUAD), which I know is not appropriate to do given their completely different molecular profiles, but I'm wondering if it's at all possible.

Out of the 27 patients with the mutation, 24 belong to LUSC and 3 belong to LUAD. Of the 14 patients without the mutation, 12 belong to LUAD and 2 belong to LUSC. If I were to compare the mutated group vs. non-mutated group without taking into account the tumor type (~ group), I'd just be looking at differences between LUSC and LUAD, which doesn't tell me anything about the effect of the mutation.

0) My original approach was to include the tumor type in the experimental design (~ project + group). However, this finds no differences, which is expected as there are only 2 and 3 samples that can be used to "correct" for the tumor type.

So, I've thought of two different approaches to try to tackle this. Both of them use patients for which the mutation has not been tested (i.e. there is no WGS or WXS data available), meaning we don't know which group they belong to, which is why I'm not sure if this is statistically correct. In my head, it makes sense to include patients for which the group is unknown, because this would help differentiate between what's different due to the mutation and due to the tumor type.

1) Do DEA for all patients in LUSC (n = ~ 500) vs. LUAD (n = ~ 500) (including mutated and non-mutated patients, as well as those patients that weren't tested for mutations) (~ project) to see which DEGs are caused by the tumor type. The "true" differentially expressed genes would then be the DEGs that came up in the simplest of analysis (~ group) but weren't DEGs in this analysis (~ project), which means they were actually due to the mutation and not to the tumor type. This first approach finds a few differences that I think make sense biologically speaking.

or...

2) For the "group" variable, assign all untested patients (n = ~1000) as "unknown" and then do DEA incorporating both the "group" and "tumor type" as factors (~ project + group). Then, I'd get the DEGs for the contrast that I'm interested in, which is mutation vs. no.mutation for the "group" variable (ignoring the "unknown" group), using results(dds, alpha = 0.05, contrast=c("group", "mutation", "no.mutation")). This second approach finds the same differences as the first approach plus some others that make biological sense given the mutation.

I'm sure the original approach is the purest statistically speaking, but due to the uneven distribution of the samples across the two variables (project and group), I think it's impossible to find any differences this way. So my question is: are the first and second approaches statistically valid?

Here's another related question: if I were to only compare mutated vs. non-mutated patients within a single tumor type (i.e., LUSC), would it be better to include the untested patients (group = "unknown") or to not include them? My intuition would suggest that a larger sample size would help during normalization, but I don't see a change in the normalized expression of the tested patients! And even so, the number of DEGs gets reduced compared to when I only include the tested patients, which makes me trust the DEGs that were common to both analysis even more. But why does this happen? Are the additional samples adjusting the statistical analysis, even if it doesn't change the normalized values?

Sorry for making this so long and overly specific, but I couldn't find any posts about including samples for which the variable of interest was unknown, only to correct for a second variable.

Thank you in advance!

EDIT: about the extra question, I think the reason why I get slightly different results is because the "unknown" group has higher variability due to including both mutated and non-mutated patients, so this inflates the per-gene dispersion estimate for the "mutation" and "no.mutation" groups. In this case, the vignette recommends comparing the two groups by creating a dataset that doesn't include the more variable "unknown" group. I wonder if this is also applicable to the first question: it would seem like in theory the second approach would have less statistical power due to including a highly variable group in the analysis; but in practice it actually gives more DEGs than the first approach.

TCGA design RNA-Seq deseq2 • 912 views

ADD COMMENT • link updated 10 months ago by LauferVA 4.2k • written 10 months ago by pilargmarch ▴ 110

1

Entering edit mode

preference for pooled analysis vs. stratified analysis can be influenced by the ratio of between-group variance to within-group variance. if you pool the samples and find 0 DEGs, a likely reason is that the between group variance was much larger than the variance attributable to your mutation of interest.

if this is indeed the case and the pooled analysis essentially failed, i'd attempt a stratified analysis i.e., doing LUSC mutated vs non-mutated and get DEGs, etc.; then, separately, doing LUAD mutated vs non-mutated and get DEG. Following this, you can look for commonalities between the two sets of results that are plausibly explicable based on the mutation in question.

ADD REPLY • link 10 months ago by LauferVA 4.2k

0

Entering edit mode

Thank you for your response! I didn't think of doing the stratified analysis because of the small sample sizes, but it is worth trying. Comparing LUSC mutated (n=24) vs. non-mutated (n=2) finds a couple of DEGs, one of which made biological sense and was also present in the second approach of my post. However, comparing LUAD mutated (n=3) vs. non-mutated (n=12) there were no significant differences. In fact DESeq2 gives me a message during the mean-dispersion relationship step, which I think is tied to the small sample sizes: "note: fitType='parametric', but the dispersion trend was not well captured by the function: y = a/x + b, and a local regression fit was automatically substituted. specify fitType='local' or 'mean' to avoid this message next time."

Even though there were no significant differences when I pooled all tested patients and tried to control for the tumor type, I still think that trying to incorporate untested patients (be it by the first or second approach) could make sense. I do concede that including additional samples is not the most statistically correct way to study this mutation, so I'll just take what I can get with the stratified analysis for LUSC.

Thank you again!

ADD REPLY • link 10 months ago by pilargmarch ▴ 110

1

Entering edit mode

one gets that message from time to time in reasonably powered datasets.

ill write out a second approach right quick

ADD REPLY • link 10 months ago by LauferVA 4.2k

score 2 · Answer 1 · 2023-06-01

In addition to the stratified analysis above, which you might consider framing as a LRT through DESeq2, another approach is to conduct a principal component analysis on the dataset, and then determine which principal components correlate with LUAD vs LUSC.

If you can determine, for instance, that PC1 and PC2 correlate with that designation (by choosing an appropriate correlation metric for the principal component loadings and an indicator variable for LUSC vs LUAD) then you enter those into the model. This has the effect of stripping out the between sample variance I mentioned.

I'd generate at least 20 PCs and correlate them all against sample type; its not uncommon to find that PC1, 2 and 7 for instance all relate. So your model would then look like

Count ~ PC1 + PC2 + PC7 + Project + Mut status + E

Finally, the thing that is really limiting statistical power here is the lack of unmutated samples. It may be possible to get additional mutated and unmutated. If you do that, you definitely want to control for batch somehow (i.e. via the PCs above).

I've had success in rescuing projects with 0 DEGs using this framework.

Edit: Responding to comment from pilargmarch, below

A bit of context i think will go a long way towards making it intuitive.

Consider Price et al., 2006 the seminal citation describing the use of principal component analysis, an eigenvector-based decomposition algorithm, to control for ancestry in genome wide studies of DNA, which has recently passed 10,000 citations.

The reason PCA can help with ancestry is that ancestry correlates with difference in allele frequency of a huge number of SNPs across the genome. In fact, assuming "clean" data with minimal batch effect, typically the top 2-3 PCs will all correlate heavily with continental ancestry in studies of variation in DNA. Why? Because ancestry is the largest influence on what is observed in the data, it is in PC1.

To see this more clearly, consider that the PCA algorithm is looking for the largest axis of variation in the high-dimensional dataset - THAT is what it sets as PC1, irrespective of the identity of the thing. But, because continental ancestry is that thing, PC1 ends up representing ancestry fairly well.

There is a nearly 1-to-1 correspondence between this case and the case of RNA. Individual SNVs here are stand ins for singular genes. As you said, the PCs abstract across the level of many genes ... but that is exactly what is necessary to remove a batch effect that cuts across many genes, right?

And, just as the largest axes of variation in DNA tend to be ancestry, the largest axes of variation in mRNA-seq tends to correlate strongly with cell type, which is why it could work as a proxy for two different tumor types.

Finally, please note that PCA and other dimensionality reduction schemes (UMAP, tSNE) are common place in scRNA-seq.

As such, intuitive or not, this kind of approach has a fairly strong grounding in the scientific literature.