Question: How to deal with the NAs in WGCNA trait file?
0
gravatar for BioLite
13 months ago by
BioLite20
BioLite20 wrote:

Hi, heroes,

I want to create a WGCNA trait file with microarray clinical information. This clinical information includes numerical variable, nominal variable, and an ordinal categorical variable, the embarrassing thing is they all have NAs.

I know using numbers to stand for the former two variables is a good method, however, how can we treat with these NAs? Another, different kinds of NAs should have variable treatments?

Sorry for my poor experience in this field. Any suggestions would be thankful!

BTW, my clinical information table looks like this: ALL non- tumor groups showing NAs

nas trait file wgcna • 586 views
ADD COMMENTlink modified 13 months ago by Kevin Blighe53k • written 13 months ago by BioLite20
3
gravatar for Kevin Blighe
13 months ago by
Kevin Blighe53k
Kevin Blighe53k wrote:

For each variable, It is important to understand to what NA actually relates. Does it mean that the variable was below the detection limit?; the patient never showed up for the test?; the test failed QC?

Some strategies that have been used for different types of NAs in continuous data:

  • impute them as 0
  • replace them with half the lowest value
  • replace with the median (if univariate testing)

You can also model the data and impute the values as model predictions.

In reality, you may not have much choice but to eliminate the samples with NAs. Looking at your data, for example, what can you realistically do with those samples that are all NA across your variables? That is a situation where, perhaps, you should bite the bullet and accept that the data is too poor to use, i.e., as opposed to trying to use it.

If you are using these in regression against your WGCNA modules, for example, then an error will be thrown. If you just correlate them, then the correlation test will usually delete the samples with NA automatically - this is controlled via the use argument passed to cor():

use
an optional character string giving a method for computing covariances in the presence of missing values. This must be (an abbreviation of) one of the strings "everything", "all.obs", "complete.obs", "na.or.complete", or "pairwise.complete.obs".

Kevin

ADD COMMENTlink modified 4 months ago • written 13 months ago by Kevin Blighe53k

Yes, Kevin, thanks for your help, I will check my data again, hoping to reach a not bad result. Thanks again!

ADD REPLYlink written 13 months ago by BioLite20

But, dear Kevin, please forgive my carelessness firstly. It's happened to me that there are "ClinicalTraits.csv" in WGCNA tutorial, like this,ClinicalTraits.csv, and we can see many NAs in this table. Another, I have noticed that tutorial codes didn't cut or make any other treatments to these NAs. The tutorial used trait file directly, as following codes,

 traitData = read.csv("ClinicalTraits.csv");
    allTraits = traitData[, -c(31, 16)];
    allTraits = allTraits[, c(2, 11:36) ];
    femaleSamples = rownames(datExpr);
    traitRows = match(femaleSamples, allTraits$Mice);
    datTraits = allTraits[traitRows, -1];
    rownames(datTraits) = allTraits[traitRows, 1];
    # Convert traits to a color representation: white means low, red means high, grey means missing entry
    traitColors = numbers2colors(datTraits, signed = FALSE);
    # Plot the sample dendrogram and the colors underneath.
    plotDendroAndColors(sampleTree2, traitColors,groupLabels = names(datTraits),main = "Sample dendrogram and trait heatmap")

It looks like Nas didn't influence WGCNA analysis. What I missed? Please show me more. A big thanks!!!

ADD REPLYlink modified 13 months ago • written 13 months ago by BioLite20
1

The numbers2colors() function can tolerate NAs, but it just colours them the default 'grey':

Missing values are allowed and will be assigned the color given in naColor

[from: https://www.rdocumentation.org/packages/WGCNA/versions/1.66/topics/numbers2colors]

ADD REPLYlink written 13 months ago by Kevin Blighe53k
1

Thanks for your patience~ Wish you have a great day!

ADD REPLYlink written 13 months ago by BioLite20

Dear Dr. Blighe In the preprocessing step, I ran goodsamplegens() function by below argument for myExprdata(36*19179):

gsg = goodSamplesGenes(myExprdata,verbose = 3)

and via that process, 146 genes were removed and I got 19033 genes. after that, I ran gcg$allok and got TRUE. but after this step, I check myExprdata and again I found 71 missing values. So, I ran goodsamplegens() function by below argument for myExprdata again:

gsg = goodSamplesGenes(myExprdata,
                                       weights = NULL,
                                       minFraction = 1/2,
                                       minNSamples =36, 
                                       minNGenes = 19000, 
                                       tol = NULL,
                                       minRelativeWeight = 0.1,
                                       verbose = 3, indent = 0)

but at this time, 178 genes were removed. 22 genes more than befor try and now I have 19001 genes without any missing. I would like to have your comments about my decision. did my decision right? or should I make another decision? I appreciate if you share your comment with me.

ADD REPLYlink modified 9 weeks ago • written 9 weeks ago by modarzi110

There does not seem to be any problem with your decision.

ADD REPLYlink written 9 weeks ago by Kevin Blighe53k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1032 users visited in the last hour