WGCNA modules and categorical traits relationship
3
3
Entering edit mode
6.3 years ago
BrunoGiotti ▴ 120

Hi there,

I have looked a bit around for this question but i haven't managed to fully understand the answers. I have done a WGCNA analysis on my data which ended up identifying several modules of co-expressed genes. Now, I would like to calculate significance of the correlation between the eigengenes and the trait data to further narrow down on what is more interesting. My traits data however has two covariates: time points, which is divided in 3, 7 and 35 days and virus strains, which are divided in 5 groups (control included). As far as i understand it shouldn't be allowed to replace integers for each strain group in order to calculate some kind of correlation, or is it? What other statistical analysis could i perform?

WGCNA • 17k views
ADD COMMENT
0
Entering edit mode

trait dataI will like to know if I can use my binary trait data directly or I have to use the "binarizeCategoricalVariable" function. Thank you

ADD REPLY
0
Entering edit mode

That decision is for you to make.

ADD REPLY
0
Entering edit mode

I will wait for response of people with similar experience

ADD REPLY
10
Entering edit mode
6.3 years ago

If these are the only traits in which you're interested, then you can either correlate the module values to these (with the traits encoded numerically as 1, 2, 3, etc), or, better, build a multinomial logistic regression model with the module values as x and time/strain as y ( glm(time/strain ~ Module1); glm(time/strain ~ Module2); et cetera)

You may also have to consider dividing up the analysis into multiple analyses, and contrast/compare the results manually. For example, running WGCNA separately for the different time-points may be an idea, and then building the regression model predicting for virus strain each time.

ADD COMMENT
1
Entering edit mode

Thanks a million for your answer Kevin, that is what i wanted to know. I'll try both methods you suggested! I already separated my data by another factor (tissue) as this was the main source of variance, interesting point to further divide the data too, ill dig into it!

Cheers!

ADD REPLY
0
Entering edit mode

Hi Kevin,

I have TPM RNA-seq file for 53 human stem cell samples control vs lead (pb) treatment in days 1-26 plus to day 0 just in control. For correlating modules to traits in WGCNA, I put 0 in control and 1 in lead treatment samples as screenshot. For example in day1 I put 0 in control and 1 in treatment. Am I correct please? However I don't know what to put for day0

The aim would be time series study

Thank you for any suggestion

ADD REPLY
1
Entering edit mode

Hello again Superstar, As I understand, you have samples that have been treated with and without lead, the metal (Pb)? Moreover, you have looked at these samples over a time-course of 0-26 days?

It is correct to encode these are 0 (control) and 1 (treatment). For Day0, although the treatment may have no effect, you should still use 0 and 1. Otherwise, you can choose to not include these.

ADD REPLY
1
Entering edit mode

Thanks a lot, you are all right about my experiments; Cells have been harvested prior to treatment (day 0) and daily, from day 1 to 26, after lead exposure and cells without treatment. Thus, I have Control_day1 to Control_day26 and Lead metal_day1 to Lead metal_day26 while I have just Control_day0, totally 53 samples. As I don't have Lead_day0, do you suggest to leave this column all with zero?

Thank you for you pateince

ADD REPLY
1
Entering edit mode

Well, the 53 day0 samples are just the 'Baseline' samples, in that case.

The way in which you encode these should reflect how you want to use them in your statistical comparisons. I presume that most of your comparisons will be:

  • Lead day1 vs Control day1
  • Lead day2 vs Control day2
  • Lead day3 vs Control day3
  • et cetera

The control day0 samples, therefore, have no immediate use in these types of pairwise comparisons; however, they represent the fundamental baseline state of the cell-type.

Edit: if you encode the day0 samples as all zero, then they will neither have utility in module comparisons because you cannot correlate something to a vector of zeros.

ADD REPLY
1
Entering edit mode

Thanks a lot Kevin, a nice weekend ahead

ADD REPLY
0
Entering edit mode

Please excuse me, today I re-read your comment; actually I don't have 53 day0 samples, instead I have totally 53 samples: Control day1 to Control day26 + Lead day1 to Lead day26 + Control day0 = 53 samples

I have put 0 for Controls and 1 for Lead but I don't know either a 0 or 1 should put for Control day0

Thank you

ADD REPLY
1
Entering edit mode

Perhaps not completely accurate (if Pb has any effect on measurements by its presence) but you could use Control day 0 for both (making an even 54 pairs).

ADD REPLY
0
Entering edit mode

Thanks a lot, as always this is not my own data and a pre-existed data set for another application in which PI has asked me to find genes related to each developmental stage in stem cells by WGCNA and time series analysis. As I don't have quantitative trait file for Lead treatment I have to make a binary trait file to relate the modules to each day. However thank you for paying attention.

ADD REPLY
0
Entering edit mode

Thanks for helping genomax. I was traveling overnight back to Europe

ADD REPLY
0
Entering edit mode

Sorry,

For correlating a binary trait file to Module eigengenes, I use Pearson correlation like

 `moduleTraitCor = cor(MsE, datTraits, use= "p")`

as I use for quantitative traits, do you think is correct?

ADD REPLY
1
Entering edit mode

Yes, that should be fine, if MsE contains your module values and datTraits contains your clinical variables / traits.

Here is an example that I did last year using my own CorLevelPlot code:

h

ADD REPLY
1
Entering edit mode

Thanks a lot Kevin,

ADD REPLY
0
Entering edit mode

Excuse me for too much questioning,

I read that EBSeq can manage differential expression without replicates. I need differentially expressed genes to obtain principal component for WGCNA. what made me confused is: I have control cell line day1 to day 26 and Lead(pb) treatment day1 to day26. The experimental design aims to get the impact of Lead(pb) on developmental process . I don't know whether I can consider days as replicates or in fact each day is a distinct condition?

Thank you for your time

ADD REPLY
1
Entering edit mode

I think that each day is a distinct 'condition'. The interest is in finding out the expression patterns that have changed on each day.

DESeq2 neither requires replicates due to the face that a 'pseudo-reference' is used for the purposes of data normalisation (specifically the size factor calculation).

ADD REPLY
1
Entering edit mode

Thank you very much for help

ADD REPLY
0
Entering edit mode

Excuse me Kevin,

I was given a trait file contains both quantitative and categorical traits for WGCNA as this figure

Gender column is (1=male, 2=female) and Biopsy_Taken column is post or pre training

I could not figure out how manage these columns for WGCNA, I then changed these columns so as this figure

where WGCNA gave me this heatmap

As you are considering, female vs male or pre-training vs post training show the same correlation only positively or negatively

If you were me how you relate these traits to your principal components please?

These are two other changes I did and then plotted

I think as genes expression has been measured in pre-training vs post-training, might be no need to include pre or post training in trait file

ADD REPLY
0
Entering edit mode

Hello Fereshteh, you have technically already done this in the best way. For these 'binary' traits, such as male / female, case / control, pass / fail, et cetera, it is best to encode them simply as 0 and 1. If you find a statistically significant positive correlation, it immediately indicates that there is a relationship between the module and the binary trait. You do not have to split the binary traits into 2 further traits.

Looking at your figure, I can say the following:

  • Gender has a statistically significant influence on 4 different modules (at 5% alpha , i.e., p<0.05)
  • Biopsy site has a statistically significant influence on 2 modules

The direction of the correlation is not of immediate interest. For binary traits, we just want to see if there are any statistically significant ones.

This way of working with binary traits follows the recommendation of the chief WGCNA developer.

ADD REPLY
0
Entering edit mode

Thank you, I will keep on then

ADD REPLY
0
Entering edit mode

Hello Kevin,

Could you please explain what is meaning of the negative correlation in the module-trait heatmap? I knew the direction of the correlations, probably, was not important from your above comments. I am just curious about it, and my traits are not binary. I got total 105 samples from 5 different tissues, 7 time points and with 3 replicates. Just a guess, does it mean genes in these modules are underexpressed regarding this particular trait?

And this is my module-trait relationship.

Thank you in advance

ADD REPLY
0
Entering edit mode

For calculating the correlation p-value, the WGCNA tutorial uses Student's t-test corPvalueStudent(). Can it also be used while dealing with categorical variables?

ADD REPLY
0
Entering edit mode

I think that anything that's logical can be used in relation to WGCNA output.

ADD REPLY
0
Entering edit mode

Can you help me with some references/tutorials as to why and how to use?

ADD REPLY
1
Entering edit mode

Here is one of my own studies: https://pubmed.ncbi.nlm.nih.gov/29908154/

There really are no rules with WGCNA.

ADD REPLY
0
Entering edit mode

Trait data Hello, similar to the above query, I have put my trait data values as binary. I will like to know if I have to use the data directly as "0" and "1" or do I have to change the data to binary format using the "binarizeCategoricalVariable" function.

ADD REPLY
0
Entering edit mode

Hello A,

Please can you let me know if you used the above data directly, or did you have to use the "binarizeCategoricalVariable" function? I have indicated my trait data with "0" and "1". I just want to be sure if I am doing the right thing.

ADD REPLY
1
Entering edit mode

I have never even heard of the binarizeCategoricalVariable() function. I just encode my categorical variables myself using factor() and relevel().

ADD REPLY
0
Entering edit mode

Thank you very much for your response. Can you please tell me how to use this with respect to my data? I have my data in the same format as posted by "A" above

ADD REPLY
0
Entering edit mode

What have you already tried? Please feel free to open a new question specifically about binarizeCategoricalVariable(), if you wish, but please also provide as much information as possible.

ADD REPLY
1
Entering edit mode

I actually did that manually, but factors() would be better. binarizeCategoricalVariable() is another option with slightly different way of analysis as described by the WGCNA developers here.

ADD REPLY
0
Entering edit mode

Can you take a look at this from Arindam, kaybio?

ADD REPLY
0
Entering edit mode

Hi Kevin,

How would you reccomend we build the logistic regression model. I am relatively new to R and having a lot of trouble trying to download nnet and follow this tutorial linked below for building a multinomial logistic regression model.

ADD REPLY
0
Entering edit mode

I cannot see what data that you have in front of you; however, generally, it just involves regressing the module eigenvalues to whatever traits that you have, e.g.:

glm(CaseControl ~ module1)
glm(CaseControl ~ module2)
... ...
glm(CaseControl ~ moduleX)
ADD REPLY
0
Entering edit mode

i asked other WGCNA question , so here the caseControl is one of the data in the dataframe inside each module ..if i understand how the basic glm runs...

ADD REPLY
0
Entering edit mode
17 months ago

Hi all,

I have a question about the module-trait correlation heatmap.


The trait is kind of exposure time, from the heatmap, there's one "lightblue" module indicated high correlation (red) with correlation coefficient 0.89, very small p value.After getting this heatmap, I output my "lightblue" module and input it to the cytoscape to visualize the network. My question is that this module is only for the 2h exposure network or for all the time points network together? My goal is to compare the 2h exposure to baseline. I only have 6 samples for each time point, it is not reasonable (too small) to sperate the time point for the heatmap and network. Any suggestions would be appreciated.

ADD COMMENT

Login before adding your answer.

Traffic: 3395 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6