I have a bulk RNA sequencing dataset. I have grown diseased and healthy cells in the lab and put them in two different culture dimensionalities, 2D and 3D culture. I then have four groups: 2D healthy, 3D healthy, 2D diseased, 3D diseased. I am trying to answer the following question:
For a set of genes (~100 genes) previously identified to regulate a disease phenotype of interest, how does culture dimensionality impact their expression profiles?
To start, I made two differentially expressed gene (DEG) lists by comparing 2D diseased vs 2D healthy (refer to as "2D contrast"), and 3D diseased vs 3D healthy ("refer to as 3D contrast"). The reason why I made these within-culture-dimensionality contrasts is to account/normalize for the non-disease-related effects between 2D and 3D culture, subsequently narrowing the focus to the disease-related differences.
I began with a scatter plot (https://ibb.co/HHFR0V6) of the log2FCs (with associated log2FC standard deviations) for the ~100 genes. If there is no effect of culture dimensionality (2D vs 3D) on diseased gene expression profiles (diseased vs healthy), then all genes' log2FCs obtained for the 2D (diseased vs healthy, in 2D) and 3D (diseased vs healthy, in 3D) comparisons should lie on the line of identity (x = y). In the attached scatter plot, a point (gene) falls outside the dashed lines when the difference for that gene's log2FC between the 2D and 3D comparisons is greater than 1. However, I see two challenges with this approach: 1) The difference between a log2FC of 3 and 2 (3 - 2 = 1) is equivalent to a difference of FC 4 and 9. The same log2FC difference between a log2FC of 4 and 3 (4 - 3 = 1) is equivalent to a difference of FC 9 and 16, which is a greater difference than the first scenario. Thus, drawing lines to indicate when a log2FC difference is >1 between comparisons is not capturing the true differences due to the exponential nature of the data. 2) A log2FC difference of 1 is arbitrary and not statistically-driven.
Relatedly, in the case the genes' log2FCs are moving in the same direction (negative or positive), I have attempted the same approach except with plotting q-values (FDR), see https://ibb.co/Ch6nJPd. There are cases where there are a set of genes with q < 0.05 in 2D, but not 3D and vise-versa. Additionally, cases were q < 0.05 in both 2D and 3D. However, the challenge lies in the cases where a gene in both 2D and 3D with q < 0.05, but the disparity between the values (say q = 0.01 in 2D, but q = 0.000001 in 3D, e.g. blue point in graph) is an interesting biological change impacted by culture dimensionality I'd like to statistically capture.
Ultimately, I am trying to discern the impacts 3D culture has on disease phenotypes when compared to 2D, while normalizing to the healthy controls.
I have spent far too long on this problem and decided to take a shot here since I am getting desperate for a solution. Thank you for your time in advance!
Hi @rpolicastro,
Thank you for the response. I will give this a try. Did you have a suggest R package in mind for doing this?
In your example, is there a way to determine that
20 - 10 = 10
represents a change that is statistically higher than15 - 10 = 5
? This is at the core of my question.Both edgeR and DEseq2 allow these regression models. Both will return the Log2FC and p-values for the interaction term comparisons.
The edgeR documentation section 3.3.4 is similar to what you want to do.
Also the other person that responded (Gordon Smyth) is the legend behind edgeR, limma, and a few other high profile softwares. Definitely take his advice into consideration.