Question: The RNA-Seq data input for WGCNA in terms of gene co-expression network construction
1
gravatar for wangdp123
17 months ago by
wangdp123140
Oxford
wangdp123140 wrote:

Hi there,

I am learning to use RNA-Seq data matrix with WGCNA to build up gene-co-expression network.

I know WGCNA was designed for the microarray data but after appropriate handling, it is also applicable to the RNA-Seq data.

Regarding the format of the input data, I have several concerns:

  1. The matrix should be FPKM values for each gene and each sample. Is there any need to normalize there FPKM values across samples before input ? (For example, quantile normalization or TMM-normalization)
  2. Should the matrix include all the genes or only differentially-expressed-genes?
  3. Should the genes with any zero FPKM values be removed before input?
  4. Should the FPKM values be transformed into log2(FPKM+1) scale before processing?

Thanks a lot,

Regards,

Tom

network wgcna • 3.8k views
ADD COMMENTlink modified 17 months ago by Kevin Blighe41k • written 17 months ago by wangdp123140
4
gravatar for Kevin Blighe
17 months ago by
Kevin Blighe41k
London, England
Kevin Blighe41k wrote:

Hey Tom,

As you've mentioned, WGCNA can indeed be used for RNA-seq data. They use WGCNA for all types of data in the lab where I was based in Boston, which is where the developer of WGCNA happened to also do postdoctoral work. He even states it on his website:

Can WGCNA be used to analyze RNA-Seq data?

Yes. As far as WGCNA is concerned, working with (properly normalized) RNA-seq data isn't really any different from working with (properly normalized) microarray data.

We suggest removing features whose counts are consistently low (for example, removing all features that have a count of less than say 10 in more than 90% of the samples) because such low-expressed features tend to reflect noise and correlations based on counts that are mostly zero aren't really meaningful. The actual thresholds should be based on experimental design, sequencing depth and sample counts.

We then recommend a variance-stabilizing transformation. For example, package DESeq2 implements the function varianceStabilizingTransformation which we have found useful, but one could also start with normalized counts (or RPKM/FPKM data) and log-transform them using log2(x+1). For highly expressed features, the differences between full variance stabilization and a simple log transformation are small.

Whether one uses RPKM, FPKM, or simply normalized counts doesn't make a whole lot of difference for WGCNA analysis as long as all samples were processed the same way. These normalization methods make a big difference if one wants to compare expression of gene A to expression of gene B; but WGCNA calculates correlations for which gene-wise scaling factors make no difference. (Sample-wise scaling factors of course do, so samples do need to be normalized.)

If data come from different batches, we recommend to check for batch effects and, if needed, adjust for them. We use ComBat for batch effect removal but other methods should also work.

Finally, we usually check quantile scatterplots to make sure there are no systematic shifts between samples; if sample quantiles show correlations (which they usually do), quantile normalization can be used to remove this effect.

In summary, according to the developer, WGCNA is fine with any normalised data because it's fundamentally based on correlation, which it in part uses to identify the modules. Even though he states this, from my experience, results will differ based on different input distributions (e.g. FPKM or TMM-normalsied counts).


Regarding your specific queries:

The matrix should be FPKM values for each gene and each sample. Is there any need to normalize there FPKM values across samples before input ? (For example, quantile normalization or TMM-normalization)

If you have FPKM values, these are already normalised and I would use these if that's all that you've got. You may also consider transforming these to Z-scale via zFPKM package prior to using WGCNA.

Should the matrix include all the genes or only differentially-expressed-genes?

That depends on what you are hoping to achieve by doing WGCNA. Generally, it's conducted as an unsupervised analysis, i.e., nothing is filtered. If you do WGCNA on a DEG-filtered dataset, you'll require a very good excuse for doing this (which is perfectly fine...you'll just have to justify it).

Should the genes with any zero FPKM values be removed before input?

WGCNA can deal with missing values. However read what they say on this here (see point 4).

Should the FPKM values be transformed into log2(FPKM+1) scale before processing?

If you want, sure. Logging FPKM data feels uncomfortable to me, though.

Good luck

Kevin

ADD COMMENTlink modified 10 weeks ago • written 17 months ago by Kevin Blighe41k

Hi Kevin,

Your detailed answer is highly appreciated.

I have a further question.

In fact, I have about 40000 genes with FPKMs for the network construction, and I try to increase maxBlockSize = 40000, in which way the calculation is running extremely slow.

In this case, do you think I need to filter out some of the genes to decrease the number of total genes for this analysis. If yes, what kind of criteria should I apply to remove the genes? (FPKM>? in ? samples)

Many thanks,

Tom

ADD REPLYlink written 17 months ago by wangdp123140

Hi Tom,

I think that 40,000 will eventually run to completion. However, you could remove transcripts/genes with low FPKM values, for example, those with FPKM<10

Take look here also (another Biostars thread with a similar issue): WGCNA maxBlockSize limit

Kevin

ADD REPLYlink written 17 months ago by Kevin Blighe41k

Hi Kevin,

Very useful answer thanks!

Just a question about using this on a set of data, If I have a series of 44 samples with two levels within those 44 samples and the aim is to compare the two levels, would it be advisable to use all 44 samples in the one analysis or to separate the two and run a separate analysis on each? In each level the samples are several taken at different points in time.

Many Thanks

ADD REPLYlink modified 11 months ago • written 11 months ago by MarjoryMollusc40

With WGCNA, what some do is to analyse them all together and to then correlate your categorical variable (that encodes the two levels) to each resulting module's values. They would have to be encoded numerically. I somewhat question that approach and would also produce a separate network for the two levels, and then compare the resulting modules and the genes within them.

You may want to read through this: Gene co-expression analysis for functional classification and gene-disease predictions. Via my own network approach ( Network plot from expression data in R using igraph ), I always produce 2 networks (e.g. for case control) and then compare the genes across each based on metrics like:

  • hub score
  • betweenness centrality
  • closeness centrality
  • vertex degree
  • community structure
ADD REPLYlink modified 10 weeks ago • written 11 months ago by Kevin Blighe41k

Those are such useful links! Thanks so much for that. Just another quick (and probably stupid/obvious question), how do you deal with duplicate genes in the count matrices? Do you just collapse the counts? Or something else? If so, how would you advise collapsing them?

The Ensembl symbols for the genes are all different, but the gene symbols themselves are sometimes duplicated. Possible doing the WGCNA with the ensembl could work and then rename them later?

ADD REPLYlink modified 11 months ago • written 11 months ago by MarjoryMollusc40
2

If you're working with human, mouse or zebrafish, any duplicated gene names with multiple Ensembl gene IDs are likely to be down to patches and haplotypes, which are alternative versions of genomic regions. We have a help video explaining these. Many people choose to ignore the patches and haplotypes, and if you're using the reference files from Ensembl, you can get these by picking the ones named "Primary assembly".

If you're working with something other than human, mouse and zebrafish and you see duplicated names, it's an error, email us helpdesk@ensembl.org, and we will do our best to fix it.

ADD REPLYlink written 11 months ago by Emily_Ensembl18k
1

No problem. On the other question, if you ask 2 people, you'll get 2 different answers! Some people do prefer to just keep the ensembl IDs as they are, do the entire analysis, and then only report HUGO / HGNC gene names in the final results/figures. Others go ahead and convert ensembl to HGNC from the very start and then take the mean value from those and use that (i.e. the mean) for analysis. Of course, if one is interested in a particular isoform of a gene, then the strategy again has to be different. So, no real answer. Let me see if I can get some other ideas from others...

ADD REPLYlink modified 11 months ago • written 11 months ago by Kevin Blighe41k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 647 users visited in the last hour