Hey Tom,
As you've mentioned, WGCNA can indeed be used for RNA-seq data. They use WGCNA for all types of data in the lab where I was based in Boston. The developer even states it on his website:
Can WGCNA be used to analyze RNA-Seq data?
Yes. As far as WGCNA is concerned, working with (properly normalized)
RNA-seq data isn't
really any different from working with (properly normalized)
microarray data.
We suggest removing features whose counts are consistently low (for
example, removing all features that have a count of less than say 10
in more than 90% of the samples) because such low-expressed features
tend to reflect noise and correlations based on counts that are mostly
zero aren't really meaningful. The actual thresholds should be based
on experimental design, sequencing depth and sample counts.
We then recommend a variance-stabilizing transformation. For example,
package DESeq2 implements the function
varianceStabilizingTransformation which we have found useful, but one
could also start with normalized counts (or RPKM/FPKM data) and
log-transform them using log2(x+1). For highly expressed features, the
differences between full variance stabilization and a simple log
transformation are small.
Whether one uses RPKM, FPKM, or simply normalized counts doesn't make
a whole lot of difference for WGCNA analysis as long as all samples
were processed the same way. These normalization methods make a big
difference if one wants to compare expression of gene A to expression
of gene B; but WGCNA calculates correlations for which gene-wise
scaling factors make no difference. (Sample-wise scaling factors of
course do, so samples do need to be normalized.)
If data come from different batches, we recommend to check for batch
effects and, if needed, adjust for them. We use ComBat for batch
effect removal but other methods should also work.
Finally, we usually check quantile scatterplots to make sure there are
no systematic shifts between samples; if sample quantiles show
correlations (which they usually do), quantile normalization can be
used to remove this effect.
In summary, according to the developer, WGCNA is fine with any normalised data because it's fundamentally based on correlation, which it in part uses to identify the modules. Even though he states this, from my experience, results will differ based on different input distributions (e.g. FPKM or TMM-normalised counts).
Regarding your specific queries:
The matrix should be FPKM values for each gene and each sample. Is
there any need to normalize there FPKM values across samples before
input ? (For example, quantile normalization or TMM-normalization)
If you have FPKM values, these are already normalised and I would use these if that's all that you've got. You may also consider transforming these to Z-scale via zFPKM package prior to using WGCNA.
Should the matrix include all the genes or only
differentially-expressed-genes?
That depends on what you are hoping to achieve by doing WGCNA. Generally, it's conducted as an unsupervised analysis, i.e., nothing is filtered. If you do WGCNA on a DEG-filtered dataset, you'll require a very good excuse for doing this (which is perfectly fine...you'll just have to justify it).
Should the genes with any zero FPKM values be removed before input?
WGCNA can deal with missing values. However read what they say on this here (see point 4, in section Runtime errors
).
Should the FPKM values be transformed into log2(FPKM+1) scale before
processing?
If you want, sure. Logging FPKM data feels uncomfortable to me, though.
Good luck
Kevin
Hi Kevin,
Your detailed answer is highly appreciated.
I have a further question.
In fact, I have about 40000 genes with FPKMs for the network construction, and I try to increase maxBlockSize = 40000, in which way the calculation is running extremely slow.
In this case, do you think I need to filter out some of the genes to decrease the number of total genes for this analysis. If yes, what kind of criteria should I apply to remove the genes? (FPKM>? in ? samples)
Many thanks,
Tom
Hi Tom,
I think that 40,000 will eventually run to completion. However, you could remove transcripts/genes with low FPKM values, for example, those with FPKM<10
Take look here also (another Biostars thread with a similar issue): WGCNA maxBlockSize limit
Kevin
Hi Kevin,
Very useful answer thanks!
Just a question about using this on a set of data, If I have a series of 44 samples with two levels within those 44 samples and the aim is to compare the two levels, would it be advisable to use all 44 samples in the one analysis or to separate the two and run a separate analysis on each? In each level the samples are several taken at different points in time.
Many Thanks
With WGCNA, what some do is to analyse them all together and to then correlate your categorical variable (that encodes the two levels) to each resulting module's values. They would have to be encoded numerically. I somewhat question that approach and would also produce a separate network for the two levels, and then compare the resulting modules and the genes within them.
You may want to read through this: Gene co-expression analysis for functional classification and gene-disease predictions. Via my own network approach ( Network plot from expression data in R using igraph ), I always produce 2 networks (e.g. for case control) and then compare the genes across each based on metrics like:
Those are such useful links! Thanks so much for that. Just another quick (and probably stupid/obvious question), how do you deal with duplicate genes in the count matrices? Do you just collapse the counts? Or something else? If so, how would you advise collapsing them?
The Ensembl symbols for the genes are all different, but the gene symbols themselves are sometimes duplicated. Possible doing the WGCNA with the ensembl could work and then rename them later?
No problem. On the other question, if you ask 2 people, you'll get 2 different answers! Some people do prefer to just keep the ensembl IDs as they are, do the entire analysis, and then only report HUGO / HGNC gene names in the final results/figures. Others go ahead and convert ensembl to HGNC from the very start and then take the mean value from those and use that (i.e. the mean) for analysis. Of course, if one is interested in a particular isoform of a gene, then the strategy again has to be different. So, no real answer. Let me see if I can get some other ideas from others...
If you're working with human, mouse or zebrafish, any duplicated gene names with multiple Ensembl gene IDs are likely to be down to patches and haplotypes, which are alternative versions of genomic regions. We have a help video explaining these. Many people choose to ignore the patches and haplotypes, and if you're using the reference files from Ensembl, you can get these by picking the ones named "Primary assembly".
If you're working with something other than human, mouse and zebrafish and you see duplicated names, it's an error, email us helpdesk@ensembl.org, and we will do our best to fix it.
Thanks Kevin for always helpful answers. I've got CPM (count per million) of a few samples (3 treatment and 3 control) obtained by RNA-seq analysis. I'm going to do co-expression analysis, but data cannot be normalized as RNA-seq data follow the negative binomial distribution, yes? however, could you please let me know how to check if the data is normalized? Also, please kindly suggest me a normalization method before doing co-expression analysis. Many thanks in advance.
Hey, from where did you obtain the CPM data? CPM is, technically, already normalised. You could increment (+1) this data and then transform via log [base 2]. Check the distribution via
hist()
(in R)My friend produced them. hist () did not show the normal distribution of CPM data. I transformed them to log2(cpm+1), but still not normalized. here is the plot after transformation. Could you please tell me your suggestion?
Do you know how s/he (it?) produced them?
As far as I know, the reads mapped to hg38 via ENCODE RNA-seq pipeline and DESeq2 package used for differentially expressed gene analysis. Kevin, I evaluated the normalization just on DEG, not all genes. Now, I found that the log2(cpm+1), not CPM, of all genes has the normal distribution. However, I'm going to do co-expression analysis on just DEG, so is it possible in this situation (not normalized distribution)?
I see - thank you for explaining. If your metric for the determination of relationships between genes is based on correlation, as is usually the case in co-expression networks, then, technically speaking, your data does not have to follow a normal distribution. However, something like WGCNA is usually performed on the entire dataset. The interpretation of the network changes if you just use the DEGs.
Do you even need to do co-expression analysis?; what is your goal, by doing it?
Hi Kevin, thank you for your comment. Actually, I'm going to create two co-expression networks, one for cases and the other for controls, separately, to find differentially co-expressed modules between two networks and eventually to determine the possible response mechanism to the disease. That's why I wanted to use DEG as input for co-expression analysis by calculating the Pearson correlation coefficients. Could you please share your idea about it? is it an appropriate and powerful method, in terms of both using DEGs (that had not normal distribution even with log2(cpm+1) and Pearson correlation coefficients, not using all genes and WGCNA? however, my sample size is few (3 cases and 3 controls) and I read somewhere that WGCNA is not recommended when the sample size is less than 15.
Ah, that is indeed quite low (sample number). The idea of using WGCNA, though, would be to include all samples and to then see if
case
|control
status correlated with any of the identified modules.Performing a network analysis separately for cases and controls is not unusual, but you may try a very simplistic approach, like here: Network plot from expression data in R using igraph
I would also use Spearman correlation, given your low sample number and non-normal distribution. Pearson correlation is a parametric test.
Kevin, many thanks for your explaining. Sorry, just two other questions, 1) what's the behind of using all samples for co-expression analysis, how we can compare the important modules between case and control in this case? 2) after log2(x+1) transformation, assuming some samples follow the normal distribution and some other samples don't follow it. Could you please let me know how we shall do in this situation, any tool, any solution?
Similar idea as doing PCA on all samples. If we do PCA, we may notice separation of cases and controls along PC1; so, PC1 would be statistically significantly correlated to
case
|control
status. In WGCNA, I believe, PCA is performed separately on each module, which produces eigenvalues for each module. If we find that, for example, blue module correlates tocase
|control
status, then we can explore further the genes in the blue module.You should check a PCA bi-plot to see if there are sample outliers.
Thank you very much for all your help.
Hello seta. It just occurred to me... if you want to verify 'normality' for each gene, you can use the Shapiro Wilk normality test.
Kevin, Many thanks for sharing it.
Hi Kevin,
I'm dealing with something similar, and wanted to check some things:
I have a lot of public RNA-seqs from different experiments and genotypes sharing similar conditions/tissues. I want to build co-expressions networks out of them.
I did all the process to get the raw counts of every sample. I removed the samples with low counts (I rejected samples with less than 3 million reads in total), and the lowly expressed genes (only kept genes with more than 2 CPMs in at least 80% of samples). I ended up with a pretty OK matrix of >1200 samples x 15k genes, showing a nice bell curve when plus-one-log-transformed.
Now, the next steps are to normalize the counts and then try to deal with batch effects.
I read that instead of log(), a better way of normalization is using a variance-stabilizing-transformation, and here start my doubts:
The VST should be applied over the raw counts, right? or the CPMs?!.
The VST should be applied to all the (not-filtered) samples together? Or should it be applied per-batch, or per-tissue at least?
Either way, if at some point later I reject samples for being outliers, I will have to repeat the VST without the rejected samples, right?
Thanks in advance!
Hey, as you are referring to VST, you must be using DESeq2?
The idea would be to normalise the raw counts (all samples combined) using the standard
DESeq()
function, and then to apply thevst()
function to transform the normalised counts to variance-stabilised expression levels. All samples should be used in both cases.If using EdgeR, then, after normalisation, the transformation would be done as log CPM + 0.25:
Hi! Thanks for the quick reply!!!
I'm using the varianceStabilizingTransformation() function from DESeq2, but I'm applying it to my raw counts matrix without using the DESeq() function. I'm doing this because I wasn't sure what model should I input to the function nor how exactly that could affect the outcome.
Anyways, if I filter some samples the transformation should be repeated?
Thank you so much!
In this case, DESeq2 will detect that no normalisation has been performed and will do it for you automatically.
Yes, it is my belief that you should re-normalise after having removed outliers.