Statistically comparing two separate DEG lists
2
2
Entering edit mode
6 months ago
Nick F ▴ 20

I have a bulk RNA sequencing dataset. I have grown diseased and healthy cells in the lab and put them in two different culture dimensionalities, 2D and 3D culture. I then have four groups: 2D healthy, 3D healthy, 2D diseased, 3D diseased. I am trying to answer the following question:

For a set of genes (~100 genes) previously identified to regulate a disease phenotype of interest, how does culture dimensionality impact their expression profiles?

To start, I made two differentially expressed gene (DEG) lists by comparing 2D diseased vs 2D healthy (refer to as "2D contrast"), and 3D diseased vs 3D healthy ("refer to as 3D contrast"). The reason why I made these within-culture-dimensionality contrasts is to account/normalize for the non-disease-related effects between 2D and 3D culture, subsequently narrowing the focus to the disease-related differences.

I began with a scatter plot (https://ibb.co/HHFR0V6) of the log2FCs (with associated log2FC standard deviations) for the ~100 genes. If there is no effect of culture dimensionality (2D vs 3D) on diseased gene expression profiles (diseased vs healthy), then all genes' log2FCs obtained for the 2D (diseased vs healthy, in 2D) and 3D (diseased vs healthy, in 3D) comparisons should lie on the line of identity (x = y). In the attached scatter plot, a point (gene) falls outside the dashed lines when the difference for that gene's log2FC between the 2D and 3D comparisons is greater than 1. However, I see two challenges with this approach: 1) The difference between a log2FC of 3 and 2 (3 - 2 = 1) is equivalent to a difference of FC 4 and 9. The same log2FC difference between a log2FC of 4 and 3 (4 - 3 = 1) is equivalent to a difference of FC 9 and 16, which is a greater difference than the first scenario. Thus, drawing lines to indicate when a log2FC difference is >1 between comparisons is not capturing the true differences due to the exponential nature of the data. 2) A log2FC difference of 1 is arbitrary and not statistically-driven.

Relatedly, in the case the genes' log2FCs are moving in the same direction (negative or positive), I have attempted the same approach except with plotting q-values (FDR), see https://ibb.co/Ch6nJPd. There are cases where there are a set of genes with q < 0.05 in 2D, but not 3D and vise-versa. Additionally, cases were q < 0.05 in both 2D and 3D. However, the challenge lies in the cases where a gene in both 2D and 3D with q < 0.05, but the disparity between the values (say q = 0.01 in 2D, but q = 0.000001 in 3D, e.g. blue point in graph) is an interesting biological change impacted by culture dimensionality I'd like to statistically capture.

Ultimately, I am trying to discern the impacts 3D culture has on disease phenotypes when compared to 2D, while normalizing to the healthy controls.

I have spent far too long on this problem and decided to take a shot here since I am getting desperate for a solution. Thank you for your time in advance!

RNA-Seq statistics • 339 views
2
Entering edit mode
6 months ago

Another potential approach is to use an interaction term in your regression model. Your experimental design is roughly the following.

           status culture
sample_1  disease      2D
sample_2  disease      2D
sample_3  disease      2D
sample_4  healthy      2D
sample_5  healthy      2D
sample_6  healthy      2D
sample_7  disease      3D
sample_8  disease      3D
sample_9  disease      3D
sample_10 healthy      3D
sample_11 healthy      3D
sample_12 healthy      3D


Including an interaction term in your model, such as ~ status + culture + status:culture allows more flexibility in contrasts. It's a bit confusing to understand at first, but to put it simply you can directly test the effect of culture dimension on the comparison of disease vs healthy.

Let's take the average expression values of hypothetical gene 1 as an example.

• mean 2D healthy = 10
• mean 2D disease = 15
• mean 3D healthy = 10
• mean 3D disease = 20

For 2D culture there is mean expression increase of 15 - 10 = 5, and for 3D culture there is a mean expression increase of 20 - 10 = 10. Clearly the gene expression increase is almost double in in diseases vs healthy when in 3D versus 2D culture.

Testing the interaction term is equivalent to doing (20 - 10) - (15 - 10) = 5, indicating that when going from 2D culture to 3D culture, the average expression increase when comparing disease to healthy increases by 5.

0
Entering edit mode

Hi @rpolicastro,

Thank you for the response. I will give this a try. Did you have a suggest R package in mind for doing this?

In your example, is there a way to determine that 20 - 10 = 10 represents a change that is statistically higher than 15 - 10 = 5? This is at the core of my question.

1
Entering edit mode

Both edgeR and DEseq2 allow these regression models. Both will return the Log2FC and p-values for the interaction term comparisons.

The edgeR documentation section 3.3.4 is similar to what you want to do.

Also the other person that responded (Gordon Smyth) is the legend behind edgeR, limma, and a few other high profile softwares. Definitely take his advice into consideration.

1
Entering edit mode
6 months ago
Gordon Smyth ★ 2.5k

Comparing a previous set of genes to your current DE list is called "gene set testing". There are a number of methods and tools to do that. We implemented a number of gene set test methods in the limma package.

0
Entering edit mode

Hi Gordon Smyth,

Thank you for the reply. To give more background, the ~100 genes were derived from using gene set testing. Currently, I am digging deeper into why this pathway has much lower p values in my 3D cultures than 2D based from the 100 genes expression profiles and subsequently nominate follow-up hypotheses around it.