Question

How should the the correlation of module eigengenes with categorical external traits be assessed in WGCNA?

0

Entering edit mode

22 months ago

BioNovice247 ▴ 20

Hi all. Sorry in advance for any gross mistakes. I'm a novice in this field as is clear from my username.

In most cases of application of Weighted Gene Co-expression Network Analysis, I see that the authors assess the correlation of module eigengenes (which is a numeric variable) with various categorical variables (such as disease status). Check the plot in the following link as an example:

https://www.ncbi.nlm.nih.gov/core/lw/2.0/html/tileshop_pmc/tileshop_pmc_inline.html?title=Click%20on%20image%20to%20zoom&p=PMC3&id=7367932_OTT-13-6805-g0004.jpg

However, I'm not sure how this correlation is assessed. I have searched on various forums and could not come up with a standard approach. My questions:

I have seen people recommending the utilization of linear regression analysis for this purpose with the dependent variable being the eigengene expressions and the independent variable being the categorical variable, and then using the square root of the R-squared as a measure for association similar to the Pearson correlation coefficient. Is this method acceptable? and if yes, then how do we determine if the correlation is positive or negative?

I have seen others saying it is possible to use logistic regression (with the categorical variable as the dependent variable and the eigengene as the independent variable) for this purpose. If this is possible, where do we get correlation coefficients from?

I have also seen people saying it is OK to numerically code the categorical variable (e.g., treated = 1 and non-treated = 0) and then use the Pearson correlation. I suspect that this is the method used in the papers I see every day (am I right?). But is this statistically sound? As far as I know, the Pearson correlation determines if a variable increases or decreases when the other variable increases or decreases. However, in this case, 0 and 1 only code for categories and do not represent an increase or decrease.
Is there any other standard approach used for this purpose that I'm not aware of? I have seen people recommending other approaches for assessing correlation of categorical and continuous variables (e.g., point-biserial correlation) but I doubt these are the methods used in the WGCNA literature.

Thanks in advance for your time and advice

RNA-seq Correlation WGCNA • 1.4k views

ADD COMMENT • link updated 22 months ago by peter.langfelder ▴ 80 • written 22 months ago by BioNovice247 ▴ 20

score 4 · Accepted Answer · 2022-06-14

They are all valid approaches and some of them are special cases of others. You can correlate a binary variable with a continuous variable, the test is equivalent to a pooled variance t-test. If you run a univariate linear regression model with the eigengene as the dependent variable, the R squared equals the correlation squared (so you can take the square root, and the sign can be determined from the sign of the coefficient). The advantage of linear models is that you can add covariates. Logistic regression flips the dependent-independent variables. Logistic regression does not lead to a natural R squared measure and I am not sure which way would be best to define it.

If I were to calculate the correlation, a simple correlation of the eigengene with the outcome would do.