Question: WGCNA modules and categorical traits relationship
2
gravatar for BrunoGiotti
2.5 years ago by
BrunoGiotti110
New York, NY USA
BrunoGiotti110 wrote:

Hi there,

I have looked a bit around for this question but i haven't managed to fully understand the answers. I have done a WGCNA analysis on my data which ended up identifying several modules of co-expressed genes. Now, I would like to calculate significance of the correlation between the eigengenes and the trait data to further narrow down on what is more interesting. My traits data however has two covariates: time points, which is divided in 3, 7 and 35 days and virus strains, which are divided in 5 groups (control included). As far as i understand it shouldn't be allowed to replace integers for each strain group in order to calculate some kind of correlation, or is it? What other statistical analysis could i perform?

wgcna • 6.2k views
ADD COMMENTlink modified 19 days ago by kaybio0 • written 2.5 years ago by BrunoGiotti110

trait dataI will like to know if I can use my binary trait data directly or I have to use the "binarizeCategoricalVariable" function. Thank you

ADD REPLYlink modified 19 days ago • written 19 days ago by kaybio0

That decision is for you to make.

ADD REPLYlink written 19 days ago by Kevin Blighe61k

I will wait for response of people with similar experience

ADD REPLYlink written 19 days ago by kaybio0
6
gravatar for Kevin Blighe
2.5 years ago by
Kevin Blighe61k
Kevin Blighe61k wrote:

If these are the only traits in which you're interested, then you can either correlate the module values to these (with the traits encoded numerically as 1, 2, 3, etc), or, better, build a multinomial logistic regression model with the module values as x and time/strain as y ( glm(time/strain ~ Module1); glm(time/strain ~ Module2); et cetera)

You may also have to consider dividing up the analysis into multiple analyses, and contrast/compare the results manually. For example, running WGCNA separately for the different time-points may be an idea, and then building the regression model predicting for virus strain each time.

ADD COMMENTlink modified 19 months ago • written 2.5 years ago by Kevin Blighe61k
1

Thanks a million for your answer Kevin, that is what i wanted to know. I'll try both methods you suggested! I already separated my data by another factor (tissue) as this was the main source of variance, interesting point to further divide the data too, ill dig into it!

Cheers!

ADD REPLYlink written 2.5 years ago by BrunoGiotti110

Hi Kevin,

I have TPM RNA-seq file for 53 human stem cell samples control vs lead (pb) treatment in days 1-26 plus to day 0 just in control. For correlating modules to traits in WGCNA, I put 0 in control and 1 in lead treatment samples as screenshot. For example in day1 I put 0 in control and 1 in treatment. Am I correct please? However I don't know what to put for day0

The aim would be time series study

Thank you for any suggestion

ADD REPLYlink modified 5 weeks ago by RamRS27k • written 2.3 years ago by A3.8k
1

Hello again Superstar, As I understand, you have samples that have been treated with and without lead, the metal (Pb)? Moreover, you have looked at these samples over a time-course of 0-26 days?

It is correct to encode these are 0 (control) and 1 (treatment). For Day0, although the treatment may have no effect, you should still use 0 and 1. Otherwise, you can choose to not include these.

ADD REPLYlink written 2.3 years ago by Kevin Blighe61k
1

Thanks a lot, you are all right about my experiments; Cells have been harvested prior to treatment (day 0) and daily, from day 1 to 26, after lead exposure and cells without treatment. Thus, I have Control_day1 to Control_day26 and Lead metal_day1 to Lead metal_day26 while I have just Control_day0, totally 53 samples. As I don't have Lead_day0, do you suggest to leave this column all with zero?

Thank you for you pateince

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by A3.8k
1

Well, the 53 day0 samples are just the 'Baseline' samples, in that case.

The way in which you encode these should reflect how you want to use them in your statistical comparisons. I presume that most of your comparisons will be:

  • Lead day1 vs Control day1
  • Lead day2 vs Control day2
  • Lead day3 vs Control day3
  • et cetera

The control day0 samples, therefore, have no immediate use in these types of pairwise comparisons; however, they represent the fundamental baseline state of the cell-type.

Edit: if you encode the day0 samples as all zero, then they will neither have utility in module comparisons because you cannot correlate something to a vector of zeros.

ADD REPLYlink modified 19 months ago • written 2.3 years ago by Kevin Blighe61k
1

Thanks a lot Kevin, a nice weekend ahead

ADD REPLYlink written 2.3 years ago by A3.8k

Please excuse me, today I re-read your comment; actually I don't have 53 day0 samples, instead I have totally 53 samples: Control day1 to Control day26 + Lead day1 to Lead day26 + Control day0 = 53 samples

I have put 0 for Controls and 1 for Lead but I don't know either a 0 or 1 should put for Control day0

Thank you

ADD REPLYlink written 2.3 years ago by A3.8k
1

Perhaps not completely accurate (if Pb has any effect on measurements by its presence) but you could use Control day 0 for both (making an even 54 pairs).

ADD REPLYlink written 2.3 years ago by genomax85k

Thanks a lot, as always this is not my own data and a pre-existed data set for another application in which PI has asked me to find genes related to each developmental stage in stem cells by WGCNA and time series analysis. As I don't have quantitative trait file for Lead treatment I have to make a binary trait file to relate the modules to each day. However thank you for paying attention.

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by A3.8k

Thanks for helping genomax. I was traveling overnight back to Europe

ADD REPLYlink written 2.3 years ago by Kevin Blighe61k

Sorry,

For correlating a binary trait file to Module eigengenes, I use Pearson correlation like

 `moduleTraitCor = cor(MsE, datTraits, use= "p")`

as I use for quantitative traits, do you think is correct?

ADD REPLYlink written 2.3 years ago by A3.8k
1

Yes, that should be fine, if MsE contains your module values and datTraits contains your clinical variables / traits.

Here is an example that I did last year using my own CorLevelPlot code:

h

ADD REPLYlink modified 19 months ago • written 2.3 years ago by Kevin Blighe61k
1

Thanks a lot Kevin,

ADD REPLYlink written 2.3 years ago by A3.8k

Excuse me for too much questioning,

I read that EBSeq can manage differential expression without replicates. I need differentially expressed genes to obtain principal component for WGCNA. what made me confused is: I have control cell line day1 to day 26 and Lead(pb) treatment day1 to day26. The experimental design aims to get the impact of Lead(pb) on developmental process . I don't know whether I can consider days as replicates or in fact each day is a distinct condition?

Thank you for your time

ADD REPLYlink modified 2.3 years ago • written 2.3 years ago by A3.8k
1

I think that each day is a distinct 'condition'. The interest is in finding out the expression patterns that have changed on each day.

DESeq2 neither requires replicates due to the face that a 'pseudo-reference' is used for the purposes of data normalisation (specifically the size factor calculation).

ADD REPLYlink written 2.3 years ago by Kevin Blighe61k
1

Thank you very much for help

ADD REPLYlink written 2.3 years ago by A3.8k

Excuse me Kevin,

I was given a trait file contains both quantitative and categorical traits for WGCNA as this figure

Gender column is (1=male, 2=female) and Biopsy_Taken column is post or pre training

I could not figure out how manage these columns for WGCNA, I then changed these columns so as this figure

where WGCNA gave me this heatmap

As you are considering, female vs male or pre-training vs post training show the same correlation only positively or negatively

If you were me how you relate these traits to your principal components please?

These are two other changes I did and then plotted

I think as genes expression has been measured in pre-training vs post-training, might be no need to include pre or post training in trait file

ADD REPLYlink modified 5 weeks ago by RamRS27k • written 2.3 years ago by A3.8k

Hello Fereshteh, you have technically already done this in the best way. For these 'binary' traits, such as male / female, case / control, pass / fail, et cetera, it is best to encode them simply as 0 and 1. If you find a statistically significant positive correlation, it immediately indicates that there is a relationship between the module and the binary trait. You do not have to split the binary traits into 2 further traits.

Looking at your figure, I can say the following:

  • Gender has a statistically significant influence on 4 different modules (at 5% alpha , i.e., p<0.05)
  • Biopsy site has a statistically significant influence on 2 modules

The direction of the correlation is not of immediate interest. For binary traits, we just want to see if there are any statistically significant ones.

This way of working with binary traits follows the recommendation of the chief WGCNA developer.

ADD REPLYlink written 2.3 years ago by Kevin Blighe61k

Thank you, I will keep on then

ADD REPLYlink written 2.3 years ago by A3.8k

Hello Kevin,

Could you please explain what is meaning of the negative correlation in the module-trait heatmap? I knew the direction of the correlations, probably, was not important from your above comments. I am just curious about it, and my traits are not binary. I got total 105 samples from 5 different tissues, 7 time points and with 3 replicates. Just a guess, does it mean genes in these modules are underexpressed regarding this particular trait?

And this is my module-trait relationship.

Thank you in advance

ADD REPLYlink modified 5 weeks ago by RamRS27k • written 12 months ago by linyao0

For calculating the correlation p-value, the WGCNA tutorial uses Student's t-test corPvalueStudent(). Can it also be used while dealing with categorical variables?

ADD REPLYlink written 5 weeks ago by Arindam Ghosh280

I think that anything that's logical can be used in relation to WGCNA output.

ADD REPLYlink written 5 weeks ago by Kevin Blighe61k

Can you help me with some references/tutorials as to why and how to use?

ADD REPLYlink written 4 weeks ago by Arindam Ghosh280
1

Here is one of my own studies: https://pubmed.ncbi.nlm.nih.gov/29908154/

There really are no rules with WGCNA.

ADD REPLYlink modified 27 days ago • written 29 days ago by Kevin Blighe61k

Trait data Hello, similar to the above query, I have put my trait data values as binary. I will like to know if I have to use the data directly as "0" and "1" or do I have to change the data to binary format using the "binarizeCategoricalVariable" function.

ADD REPLYlink written 19 days ago by kaybio0

Hello A,

Please can you let me know if you used the above data directly, or did you have to use the "binarizeCategoricalVariable" function? I have indicated my trait data with "0" and "1". I just want to be sure if I am doing the right thing.

ADD REPLYlink written 19 days ago by kaybio0
1

I have never even heard of the binarizeCategoricalVariable() function. I just encode my categorical variables myself using factor() and relevel().

ADD REPLYlink written 19 days ago by Kevin Blighe61k

Thank you very much for your response. Can you please tell me how to use this with respect to my data? I have my data in the same format as posted by "A" above

ADD REPLYlink written 19 days ago by kaybio0

What have you already tried? Please feel free to open a new question specifically about binarizeCategoricalVariable(), if you wish, but please also provide as much information as possible.

ADD REPLYlink written 17 days ago by Kevin Blighe61k
1

I actually did that manually, but factors() would be better. binarizeCategoricalVariable() is another option with slightly different way of analysis as described by the WGCNA developers here.

ADD REPLYlink written 17 days ago by Arindam Ghosh280

Can you take a look at this from Arindam, kaybio?

ADD REPLYlink written 17 days ago by Kevin Blighe61k

Hi Kevin,

How would you reccomend we build the logistic regression model. I am relatively new to R and having a lot of trouble trying to download nnet and follow this tutorial linked below for building a multinomial logistic regression model.

ADD REPLYlink written 11 months ago by ninabhatia30

I cannot see what data that you have in front of you; however, generally, it just involves regressing the module eigenvalues to whatever traits that you have, e.g.:

glm(CaseControl ~ module1)
glm(CaseControl ~ module2)
... ...
glm(CaseControl ~ moduleX)
ADD REPLYlink written 11 months ago by Kevin Blighe61k

i asked other WGCNA question , so here the caseControl is one of the data in the dataframe inside each module ..if i understand how the basic glm runs...

ADD REPLYlink written 7 months ago by krushnach80810
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1000 users visited in the last hour