Question

Complex multifactorial DE analysis with limma/edgeR based on rnaseq data

0

Entering edit mode

7 months ago

svlachavas ▴ 790

Dear Biostars,

I would like to ask you one specific question regarding the DE analysis on an RNASeq dataset of samples, spanning a multi-factor experimental design. Briefly, unstimulated neutrophils of 4 healthy donors, were cultivated with distinct treatment conditions-that is, supernatant of organoids from different cancer/normal patient samples; There are also an additional treatment “sub-conditions”, which are not from an organoid patient sample, rather from 2 different "protein cocktails", inducing an”anti-tumoral” and “pro-tumoral” effect; A small subset of the relative phenotypic information is depicted below:

head(nets.pheno.dt.2,15)
         Donor_ID Condition Treatment
CRC1_D10 Donor_10    Cancer      CRC1
CRC1_D11 Donor_31    Cancer      CRC1
CRC1_D5   Donor_5    Cancer      CRC1
CRC2_D10 Donor_10    Cancer      CRC2
CRC2_D5   Donor_5    Cancer      CRC2
CRC2_D7   Donor_7    Cancer      CRC2
CRC3_D10 Donor_10    Cancer      CRC3
CRC3_D11 Donor_11    Cancer      CRC3
CRC3_D5   Donor_5    Cancer      CRC3
CRC4_D11 Donor_11    Cancer      CRC4
CRC4_D5   Donor_5    Cancer      CRC4
CRC4_D7   Donor_7    Cancer      CRC4
CRC5_D10 Donor_10    Cancer      CRC5
CRC5_D11 Donor_11    Cancer      CRC5
CRC5_D5   Donor_5    Cancer      CRC5
...

In addition, after small processing and filtering, I created the following MDS plot:

y <- DGEList(counts=mat.counts, samples = nets.pheno.dt.2)
y$genes <- data.frame(ENSEMBL=rownames(y), SYMBOL=final.annot$GENENAME)
cond.NET <- y$samples$Condition
keep.exprs <- filterByExpr(y, group=cond.NET)
y.filt <- y[keep.exprs,, keep.lib.sizes=FALSE]

dge <- calcNormFactors(y.filt, method = "TMM")
lcpm <- cpm(dge, log=TRUE)
plotMDS(lcpm, labels=dge$samples$Condition, col=as.numeric(dge$samples$Donor_ID))

MDS plot: labeling is based on the Condition factor, color on the 4 distinct donor IDs

Our major biological questions are: to have a general comparison of cancer vs normal “treatment” conditions; In addition if doable, to compare also the different “cancer-patient” treatment/condition effects, regardless of donor influence (i.e. CRC1 vs CRC2, Anti-tumoral versus tumoral, etc.);

However, as you can see from the MDS plot, where the different color denotes the different donor, clearly shows a dominant effect of the distinct donors, where the sub-conditions are barely separated, except mostly the "anti-tumoral" treatment; Hence, no differences between the "cancer" and "normal" treatment conditions; In addition, not all levels of experimental factors are represented in all “donors”; for example, donor 5 (dark blue color) does not include any “normal” treatment conditions;

Any suggestions or ideas on this matter would be greatly appreciated !!

Cheers,
Efstathios

R edgeR limma RNA-seq • 772 views

ADD COMMENT • link updated 7 months ago by LChart 3.9k • written 7 months ago by svlachavas ▴ 790

0

Entering edit mode

Our major biological questions are: to have a general comparison of cancer vs normal “treatment” conditions; In addition if doable, to compare also the different “cancer-patient” treatment/condition effects, regardless of donor influence (i.e. CRC1 vs CRC2, Anti-tumoral versus tumoral, etc.);

This is not a question. I'm not saying this to be pedantic -- this is the fundamental reason why you feel stuck. Without a specific hypothesis like "Question: Do treatments with cancer supernatant conditions up-regulate inflammatory genes compared to normal supernatant treatments? Hypothesis: Yes" you can't figure out the appropriate measurements, contrasts, and covariates to answer the question. Once you have a contrast for one gene (or one pathway) you can scale it up for all genes or all pathways.

Start with some hypotheses.

ADD REPLY • link 7 months ago by LChart 3.9k

0

Entering edit mode

Dear LChart,

thank you very much for your reply !! Initially apologies for not providing more information and methodological details, as I had to discuss again with our collaborators, as I did not participate in the experiment/initial experimental design, rather that just involved at this level trying to support the downstream analysis;

On this direction, as pinpointed above, two are the main specific questions-contrasts:

Indeed, one original hypothesis that we wanted to investigate, is whether we can identify overall any differentially expressed genes between cancer supernatant versus normal supernatant conditions? And in a subsequent level exploit if we can find any genes related to inflammation and/or immune response; In parallel, also investigate the Pro-tumoral vs Anti-tumoral condition effect, based on the "dual-role" of neutrophils. On this premise, I tried the following approach:

Cond_NET <- as.factor(dge$samples$Condition)
Donor_group <- as.factor(dge$samples$Donor_ID)

design <- model.matrix(~0 + Cond_NET + Donor_group)
colnames(design) <- gsub("Cond_NET", "", colnames(design))

v1 <- voomWithQualityWeights(dge, design, plot = T)

voom.fit <- lmFit(v1, design)

contrast.matrix <- makeContrasts(comp1= Cancer - Normal,
                                 comp2= Pro_tumoral - Anti_tumoral,
                                 levels=design)   

fit2 <- contrasts.fit(voom.fit, contrast.matrix)
fit3 <- eBayes(fit2, trend = TRUE, robust = TRUE)

Hence, I tried a general "overall" group comparison of the "general" conditions, trying to model the "donor" effect and including it into the design matrix as a nuisance variable; However, my concern here is that one of the donors does not had any "normal" samples, and accordinly the differences I see (from the DE results using conventional cutoffs) are mostly derived from this "batch"? (based also on the MDS plot)

Next, some more complicated questions would be to compare between different Treatment groups; For example compare the different cancer supernatant conditions-like CRC1 vs CRC2-while still controlling for the donor effect; However, as this resembles both between and within subject comparisons, which would be the most robust way to proceed? For example, implementing the duplicateCorrelation factor, and following something like this?

g <- factor(paste0(dge$samples$Condition, dge$samples$Donor_ID))
design <- model.matrix(~0 + g)

# and then block on the Donor_ID variable;

Still, here my major "issue" is that the different donors, are not represented by all levels of conditions, such as the absence of any "normal" treatments in 1 out of the 4 donors; And if this could inflate or bias the estimation of "intra-donor" correlations;

To summarize, these are my two major "methodological" questions and how the design matrices could be robustified based on our questions;

PS1: The columns Condition & Treatment might seen a bit "redundant" and not as in the same example like the section 9.7 of limma's user guide for multi-level experiments; However, the denotion "CRC1", "CRC2" are essentially from different patient cancer/normal supernatants, hence nor technical replicates that could be averaged; That is why I implemented this information into a separate column, but not completely certain if fits my second question case scenario;

Thank you in advance;

Efstathios

ADD REPLY • link 7 months ago by svlachavas ▴ 790

1

Entering edit mode

I tried a general "overall" group comparison of the "general" conditions, trying to model the "donor" effect and including it into the design matrix as a nuisance variable; However, my concern here is that one of the donors does not had any "normal" samples, and accordinly the differences I see (from the DE results using conventional cutoffs) are mostly derived from this "batch"? (based also on the MDS plot)

IMO: you set it up the "right" way using a fixed effect for the donor. Unfortunately it is all too common that the result is driven by the small subset of samples that were incomplete. The obvious thing to do is to drop the donor(s) with only one group in the contrast, and evaluate the impact on the results. If there are a handful of genes you're interested in, you could try a fully Bayesian hierarchical model, which would allow partial pooling across donors (using something like pymc), but this won't scale transcriptome-wide at all. I had reasonable success with this approach in precisely this scenario (proteomics data, however, so I did not have to deal with setting up a negative binomial model for counts).

Still, here my major "issue" is that the different donors, are not represented by all levels of conditions, such as the absence of any "normal" treatments in 1 out of the 4 donors; And if this could inflate or bias the estimation of "intra-donor" correlations;

So the duplicateCorrelation factor is in effect adding a random effect variable, which is a block matrix M[i,j] = 1 iff donor[i] == donor[j]. Raggedness of treatment combined with not also having a random effect for the treatment may indeed result in an omitted variable bias in the estimate of the random effect, and it's hard to say if that will result in an underestimate (variance due to donor is deflated as donors look artificially more independent as treatments are largely random across donor) or overestimate (variance do to donor is inflated as donors look artificially dependent as treatments are not random across donor, i.e., the first 4 donors got CRC1 and CRC2; the next 4 got CRC1 and CRC3, etc.). Even so, having the factor is likely better than not having it.

I would argue that a more robust approach (though somewhat of a pain) would be to perform a meta analysis: restrict the contrasts only to those donors that support them, and meta-analyze the resulting statistics.

ADD REPLY • link 7 months ago by LChart 3.9k

0

Entering edit mode

Dear LChart,

thanks a million for your detailed response and explanations !! Some quick updated comments (hopefully not disturb you further) in order to fully understand your suggestions:

1) Concerning the general comparison, based on dropping the specific donor, which has not any "normal" samples in the condition variable; You would remove all related samples from this donor from start? So anything from raw counts up to filtering is completely "re-computed"? and then move to the actual design matrix setup as in point 1?

2) Thank you also for your confirmation regarding the problem of implementing the "duplicateCorrelation" approach;

3) In parallel, would you find also "robust" also a "per-sample" analysis? like an ssGSEA implementation based on normalized expression values (or z-scores) for each sample? To perform a clustering analysis and also investigate some functional underpinnings and/or how the samples are grouped? And also on this premise, you would still exclude the donor without the normal samples? For the analysis to be more "coherent"?

PS: I will definately try as a final step to consider the "meta-analysis" approach;

Best,

Efstathios

ADD REPLY • link 7 months ago by svlachavas ▴ 790

1

Entering edit mode

(1) Yes; but you can simply subset the DESeq via dds[,dds$donor != UNWANTED_DONOR]

(3) Given the initial results (and the MDS) that donor will cluster by itself anyway, so dropping it likely won't impact meta-analysis. Depends on the loading of that donor onto the MDS/PCs -- you could dry PCA without that donor and projecting that donor onto those PCs. For ssGSEA - in practice treat scores as though they were normalized gene expression; so same considerations apply.

This is from my phone so sorry for brevity.

ADD REPLY • link 7 months ago by LChart 3.9k