Why are all the networks i get from WGCNA look like this?
1
0
Entering edit mode
3 months ago
Faith ▴ 40

Hello,

I have been working on WGCNA, and after following the tutorial, and normalizing using VAT with no filteration except for what is recommended by the old FAQ. Ended up with 18K genes

All of my modules are essentially 1 gene in the middle and 1000+ genes connected to it, could it be that I'm doing something wrong?

What i'm trying to do is see if there are certain genes regulating a group of other genes, I sort of expected alot of interconnected nodes, not a big network connected only to 1 gene.

enter image description here

Here is the code that I used to create the modules

    powers = c(c(1:20), seq(from = 22, to=30, by=2))
    sft = pickSoftThreshold(normalized_reads_transposed, powerVector = powers, verbose = 5, networkType = 'signed') 
    power=sft$powerEstimate #22

    holder_corr<- cor #We will replace cor function with WCGNA cor function so this will hold the original
cor<- WGCNA::cor

    flower_network<-blockwiseModules(normalized_reads_transposed,
                     power= 22, #Based on soft threshold function above
                     TOMType = 'signed', #signs of network used enforce adjacency signed networks
                     mergeCutHeight = '0.25', #Merge thresholds
                     numericLabels = F, #Colors
                     maxBlockSize = 20000, #clustering block size, data hads 18K genes so i want them in one block for 1 

    clustering
                     verbose = 3, #info output
                     corType = 'pearson', #Pearson
                     networkType = 'signed',
                     saveTOMs = T,
                     saveTOMFileBase = 'TOM',
                     randomSeed = 1234) #signed network for biological regulation
WGCNA R • 1.4k views
ADD COMMENT
0
Entering edit mode

That's how a scale free topology network should look like.

However, it is a little bit strange that every module has a scale free topology because RNA-seq data are not topically scale free. Perhaps you should provide the chunck of code used to export the wgcna modules to cytoscape and how many samples did you use for the analysis

ADD REPLY
0
Entering edit mode

Thanks for getting back to me Andres, here is the code I used to export the network. If you think this chunk of code could be useful let me know and I will upload the rest of it.

####Exporting full edge and node file####
modules = c(unique(module_colors));
# Select module probes
probes = colnames(normalized_reads_transposed)
inModule = is.finite(match(module_colors, modules));
modProbes = probes[inModule];
modGenes = annot$Gene_names[match(modProbes, annot$Gene_ID)];
# Select the corresponding Topological Overlap
modTOM = TOM_matrix[inModule, inModule];
dimnames(modTOM) = list(modProbes, modProbes)
# Export the network into edge and node list files Cytoscape can read
cyt = exportNetworkToCytoscape(modTOM,
                               edgeFile = paste("C:/Users/s/Desktop/metabolite regulation/WGCNA/raw_data/signed_CytoscapeInput-edges-all", ".txt", sep=""),
                               nodeFile = paste("C:/Users/s/Desktop/metabolite regulation/WGCNA/raw_data/signed_CytoscapeInput-nodes-all", ".txt", sep=""),
                               weighted = TRUE,
                               threshold = 0,
                               nodeNames = modProbes,
                               altNodeNames = modGenes,
                               nodeAttr = module_colors[inModule])

####Cytoscape reading files####
cytoscapePing () # make sure cytoscape is open


edge <- read.delim("cytoscape files/signed_CytoscapeInput-edges-brown.txt")
colnames(edge)
colnames(edge) <- c("source", "target","weight","direction","fromAltName","toAltName")

node <- read.delim("cytoscape files/signed_CytoscapeInput-nodes-brown.txt")
colnames(node)  
colnames(node) <- c("id","altName","node_attributes") 

createNetworkFromDataFrames(node,edge[1:4000,], title="signed_network_weighted_magenta", collection="Arabidopsis transcript reads")
setVisualStyle('Marquee')
ADD REPLY
0
Entering edit mode

how did you get TOM_matrix and module_colors? Please, show me the code.

Also, can you show me the Scale Free Topology plot and the Mean Connectivity plot, and tell me on how many samples did you run WGCNA?

ADD REPLY
0
Entering edit mode

Okay in order:

  1. Plot for mean connectivity a scale free Topology: enter image description here

I picked the 22 according to the function i mentioned earlier

sft = pickSoftThreshold(normalized_reads_transposed, powerVector = powers, verbose = 5, networkType = 'signed') 
power=sft$powerEstimate #22
  1. How many samples 24 samples, as I have 8 timepoints (growth stages) each one has 3 biological replicates.

  2. How did I obtain TOM_matrix and module colors

     holder_corr<- cor #We will replace cor function with WCGNA cor function so this will hold the original
     cor<- WGCNA::cor
    
     flower_network<-blockwiseModules(normalized_reads_transposed,
                      power= 22, #Based on soft threshold function above
                      TOMType = 'signed', #signs of network used enforce adjacency signed networks
                      mergeCutHeight = '0.25', #Merge thresholds
                      numericLabels = F, #Colors
                      maxBlockSize = 20000, #clustering block size, data hads 18K genes so i want them in one block for 1 clustering
                      verbose = 3, #info output
                      corType = 'pearson', #Pearson
                      networkType = 'signed',
                      saveTOMs = T,
                      saveTOMFileBase = 'TOM',
                      randomSeed = 1234) #signed network for biological regulation
    
     cor <- holder_corr
    
     load("TOM-block.1.RData") #loads the saved TOM
    

For the chunk above I saved TOM matrix using saveTOM = T, and then loaded it using RData file that the function outputs, using load(etc..). That's how I obtained the TOM matrix, what was odd about it is every two similar correlations say me with myself, the usual correlation value should be 1 as it is from TOMfromExpression function gives out, but the TOM i loaded from blockmodules has these values set to 0. I think because it views them as distances? A distance between me and myself is 0? I'm not sure.

####exporting MEs####
MES<- flower_network$MEs #Module eigengenes
module_labels<- flower_network$colors
module_colors<- labels2colors(module_labels)
tree<- flower_network$dendrograms[[1]]
matrix_TOM<- as.matrix(tom_from_onestep) #The one loaded using RData

Obtaining the colors was using the $colors in flower_network (blockwisemodule output) and then I transferred the labels into colors. That's how I obtained the module_colors.

ADD REPLY
0
Entering edit mode

Obtaining the colors was using the $colors in flower_network (blockwisemodule output) and then I transferred the labels into colors. That's how I obtained the module_colors.

Your modules are already labeled as colors so you do not need

module_colors<- labels2colors(module_labels)

By the way, I don't think this is the problem. Would you mind sharing the normalized_reads_transposed file?

ADD REPLY
0
Entering edit mode

Ofcourse, I have the normalized reads file is the file that i normalized the raw data using vst, and there is another one that I had pre-normalized for me on log2 scale.

I will link both in a google drive below: https://drive.google.com/drive/folders/1fde_vSnInrISZc-TUm5YsJBk2A1P0C2w?usp=sharing

DEG: normalized log2 differentially expressed genes

Normalized reads: VST normalized genes.

Both come from the same rawfile just a different normalization attempt based on suggestions I got from biostar members that VST is preferred for WGCNA.

Note: Original rawcount was 29K~

Here is what I did to obtain the genes you see in normalized reads

####Genes quality check ####

gg<- goodSamplesGenes(transposed_raws_modified) #allOK = False #Left to default values 
table(gg$goodGenes) #2177 outliers
table(gg$goodSamples) #All good


#Let's transpose the dataframe back real quick so we can filter it out
raws_modified<- raws_modified[gg$goodGenes==T,] #Let's filter out the outliers we should have 26319 left
dim(raws_modified) #26319 24 perfect

rm(transposed_raws_modified) #for clarify in dataframes available and to avoid confusion


############
####Euclidean clustering ####
###########
sampleTree = hclust(dist(t(raws_modified), method = 'euclidean'), method = "average")

# plot sample tree
plot(sampleTree, main = "double checking if there are still outliers", sub="", xlab="",
     cex.lab = 1.5,cex.axis = 1.5, cex.main = 2)
rect.hclust(sampleTree , k = 7, border = 2:6)



#Based on clustering there are 3 samples that might be outliers, will be removed
#buds_G1, buds_F1, anthesis_E1
raws_modified<- raws_modified[,-c(19,16,13)] #removed possible outlier columns


##############
####DeSeq filteration and VST normalization####
##############
col_data<- as.data.frame(colnames(raws_modified))
colnames(col_data)<- 'column_data'
dds<- DESeqDataSetFromMatrix(countData = raws_modified, colData =col_data, design= ~1) #to do vst only

#WCGNA recommended this filteration step
#remove genes with counts <15 in 75% of samples
l<- length(colnames(raws_modified)) #lenght of samples
perc<- round(l*0.75) #Get 75% of them and round
dds75<- dds[rowSums(counts(dds)>=15)>=perc,] #7728 removed - 
dim(dds75) #21 samples, 18591

normalized_dds<- vst(dds75) #Normalize with varianace stabailization (vst)

normalized_reads_transposed<- assay(normalized_dds)%>%t() #transposed normalized reads

normalized_reads<- assay(normalized_dds) #untransposed normalized reads
dim(normalized_reads) #18591 x 21 
ADD REPLY
0
Entering edit mode

Thanks for the file.

I will try to replicate the analysis during weekend.

ADD REPLY
0
Entering edit mode

Thank you Andres.

ADD REPLY
3
Entering edit mode
12 weeks ago

Hi Faith,

The problem is edge[1:4000,] in combination with threshold = 0 in exportNetworkToCytoscape

By default every gene (node) in WGCNA is connected. If threshold = 0, every node in the edge file will have 18590 connections. With edge[1:4000,] in

createNetworkFromDataFrames(node,edge[1:4000,], title="signed_network_weighted_magenta", collection="Arabidopsis transcript reads")

You are selecting the first 4000 edges of the first node you have in the edge data.frame.

    > head(edge[1:4000,], 10)
        fromNode    toNode      weight  direction fromAltName toAltName
    1  AT1G01290 AT1G01350 0.003476798 undirected          NA        NA
    2  AT1G01290 AT1G01390 0.012741588 undirected          NA        NA
    3  AT1G01290 AT1G01420 0.003299341 undirected          NA        NA
    4  AT1G01290 AT1G01520 0.016798608 undirected          NA        NA
    5  AT1G01290 AT1G01540 0.044370724 undirected          NA        NA
    6  AT1G01290 AT1G01750 0.035806801 undirected          NA        NA
    7  AT1G01290 AT1G02080 0.012436343 undirected          NA        NA
    8  AT1G01290 AT1G03365 0.015908483 undirected          NA        NA
    9  AT1G01290 AT1G03430 0.011323847 undirected          NA        NA
    10 AT1G01290 AT1G03440 0.058240634 undirected          NA        NA
... -- omitted 3990 rows

This is why you always see a single node connected to multiple nodes.

In conclusion, set a threshold in exportNetworkToCytoscape and do not subset the edge dataframe file like you did in createNetworkFromDataFrames

ADD COMMENT
0
Entering edit mode

Hi Andres, I don't even know how to thank you for all the effort you put into helping me figure this out. I have a few more questions that I hope you could help me or provide insights on.

  1. You saw the two normalized files, for WGCNA would you recommend a DEG log2 genes, or VST filtered genes? I did a heatmap between modules and samples, the log2 followed sensible patterns of development while VST was more colorful (higher positive and negative values, but not following patterns of biological development across modules).

  2. If I have another data for say 100 genes that I want to investigate (growth genes maybe) and I want to see what other genes are regulating them, what would you suggest to do? I know I don't have traits to filter with, but what I did was filter out the 100 genes (X) I want from the edge files and see which genes are connected to them (Y). I feel like this way kinda takes out the bigger picture as some genes (Z) that might be connected to the genes (Y) connected to my 100 genes (X) are taken out, so I don't know what might be regulating the genes (Y) that regulate my 100 genes (X) and so on. Tried to label them with X Y Z so you don't get confused.

  3. Is there a better workflow than WGCNA to achieve my 2nd point?

  4. Why are they undirected despite me choosing "signed"?

  5. Is there a way to choose the threshold value? any mathematical way or function that can estimate a proper threshold or is it arbitrary?

I appreciate your help Andres, and sorry if my questions are too much.

ADD REPLY
0
Entering edit mode

If you have an answer to just 1 of my questions below I had be very grateful.

ADD REPLY
0
Entering edit mode

You saw the two normalized files, for WGCNA would you recommend a DEG log2 genes, or VST filtered genes?

Never use DEG with WGCNA

Why are they undirected despite me choosing "signed"?

The term "signed" does not have anything to do with the direction of the connection. Please, read the WGCNA manuscript: https://bmcbioinformatics.biomedcentral.com/articles/10.1186/1471-2105-9-559

Is there a way to choose the threshold value? any mathematical way or function that can estimate a proper threshold or is it arbitrary?

It is arbitrary, but you should focus on the genes with the highest intramodular connectivity

ADD REPLY
0
Entering edit mode

That was very informative. Thank you Andres.

If you could share with me your thoughts on how I filtered my data to obtain the vst dataset from rawcounts, that would be great. If there is anything wrong or missing, could you please let me know? I haven't worked on RNA data before so it's all new to me.

gg<- goodSamplesGenes(transposed_raws_modified) #Removed outlier genes from this function
sampleTree = hclust(dist(t(raws_modified), method = 'euclidean'), method = "average") #Removed 3 sample outliers here.
dds<- DESeqDataSetFromMatrix(countData = raws_modified, colData =col_data, design= ~1) #for vst and colData are just column names.
#remove genes with counts <15 in 75% of samples
l<- length(colnames(raws_modified)) #lenght of samples
perc<- round(l*0.75) #Get 75% of them and round
dds75<- dds[rowSums(counts(dds)>=15)>=perc,]
normalized_dds<- vst(dds75) #apply vst (This is my VST dataset here)

As for intramodular connectivity, I will read the paper again this week. But from my understanding, It's preferred that I build my network using high intramodular values, not highest weight values. Is that correct? So I will use the output of the function you mentioned instead of the output of TOM?

ADD REPLY
1
Entering edit mode

Here is how I tipically proceed:

  1. Filter out low count genes
  2. Apply the variance stabilizing transformation
  3. Run sampleTree or a PCA on vst transformed data for the identification of outliers
  4. remove the outliers in case you have any, and normalize again the data
  5. Run WGCA

So I will use the output of the function you mentioned instead of the output of TOM?

No. The TOM is the network. You can use the intramodular connectivity output to identify the nodes in your module with the highest intramodular connectivity

ADD REPLY
0
Entering edit mode

Hi Andres... I would like your help again.

I got the intramodular connectivity and understood what each column means. I still can't figure out the best way to filter my genes? should I merge the intramodular output with each node file I have and choose the highest 10 intramodular genes? or how do people filter if merging after importing the node file isn't smart?

ADD REPLY

Login before adding your answer.

Traffic: 2130 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6