Hi all,
I am working on the DE analysis of primary vs metastasis using a small set of paired-samples (8 primary tumors & 8 metastasis).
After Variance stabilizing transformation using DEseq2, my PCA plot shows that the samples group by patient and I cannot really differentiate the Primary to the metastasis groups. As consequence, I cannot find any differentially expressed genes between the tested condition. In DEseq2 I tried to add in my design formula (design = ~ pat + cond
) the patients and the condition but it does not change anything.
I decided to test the new batch-effect adjustment tool (ComBat-Seq) in my counting matrix, adding patients as batch and specifying the condition as biological covariates. It improves my PCA plot and I can do find relevant genes associated to the metastasis when I perform the DE analysis in the adjusted data.
My question is: Is it wrong to use the patient label as batch and perform such adjustment to my matrix? Is there any other approach that I could try to alleviate the effect of patients in my analysis?
#DE
dds <- DESeqDataSetFromMatrix(countData = matrix_prim_vs_pm,
colData = cond,
design = ~ pat + cond)
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep,]
dds$cond <- factor(dds$cond, levels = c("prim","pm"))
vst <- vst(dds)
#PCA plot
plotPCA(vst , intgroup=c("cond")) + geom_text(aes(label=name),vjust=2)
#batch adjustment
adjusted_counts <- ComBat_seq(matrix_prim_vs_pm, batch=pat, group=cond)
PCA witohut Batch correction
PCA after Combatseq
Thanks in advance
I'd be wary when doing batch correction while informing the batch correction tool about your biological conditions. Try permuting your condition (metastasis/control) labels and see how the clustering looks following ComBat.
I personally only do ComBat-type stuff when I don't deal with biological conditions (e.g. if I want to see which genes are correlated among 50 tumor samples which were sequenced in different batches) -- not for any primary DE analysis.
Some important thoughts (from others) here: A: Are we tricking ourselves with batch effect correction?
The author of that linked post writes: "Our primary advice for an investigator facing an unbalanced data set with batch effects, would be to account for batch in the statistical analysis. If this is not possible, batch adjustment using outcome as a covariate should only be performed with great caution."
I tend to agree.
Thanks for the link. Interesting reading! I will permute the conditions and see whats going to happen.
It seems that when I permute my condition it affects a bit the way samples are clustering. I noticed that I can obtain better results when I do not add covariate (biological condition) and mention only the batch group in the Combat_seq...
But still very inconclusive. I will perform GSEA in the DE genes to check if somehow it does correlate with metastasis.
Can you please add all code and plots to the post? It is difficult to argue only on words. In principle the strategy of adding
pat
as a blocking factor as you did into the DESeq2 design and treating it as batch with Combat-Seq should give similar results from what I understand as both tries to eliminate the base line difference that the different patients introduce, focusing on the tumor vs metastasis comparison.True! code and PCAplot were added to the post; PM group corresponds to metastasis.
Thanks! How many DEGs do you get using either of the two strategies? In fact I do not really see an "improvement" in the second PCA, you still have notable dispersion.
Yes, agree with the modest improvement. Without Combat_seq 0 and after Batch correction 93 genes.