The RNA-Seq data input for WGCNA in terms of gene co-expression network construction
1
4
Entering edit mode
6.3 years ago
wangdp123 ▴ 340

Hi there,

I am learning to use RNA-Seq data matrix with WGCNA to build up gene-co-expression network.

I know WGCNA was designed for the microarray data but after appropriate handling, it is also applicable to the RNA-Seq data.

Regarding the format of the input data, I have several concerns:

  1. The matrix should be FPKM values for each gene and each sample. Is there any need to normalize there FPKM values across samples before input ? (For example, quantile normalization or TMM-normalization)
  2. Should the matrix include all the genes or only differentially-expressed-genes?
  3. Should the genes with any zero FPKM values be removed before input?
  4. Should the FPKM values be transformed into log2(FPKM+1) scale before processing?

Thanks a lot,

Regards,

Tom

WGCNA network • 22k views
ADD COMMENT
24
Entering edit mode
6.3 years ago

Hey Tom,

As you've mentioned, WGCNA can indeed be used for RNA-seq data. They use WGCNA for all types of data in the lab where I was based in Boston. The developer even states it on his website:

Can WGCNA be used to analyze RNA-Seq data?

Yes. As far as WGCNA is concerned, working with (properly normalized) RNA-seq data isn't really any different from working with (properly normalized) microarray data.

We suggest removing features whose counts are consistently low (for example, removing all features that have a count of less than say 10 in more than 90% of the samples) because such low-expressed features tend to reflect noise and correlations based on counts that are mostly zero aren't really meaningful. The actual thresholds should be based on experimental design, sequencing depth and sample counts.

We then recommend a variance-stabilizing transformation. For example, package DESeq2 implements the function varianceStabilizingTransformation which we have found useful, but one could also start with normalized counts (or RPKM/FPKM data) and log-transform them using log2(x+1). For highly expressed features, the differences between full variance stabilization and a simple log transformation are small.

Whether one uses RPKM, FPKM, or simply normalized counts doesn't make a whole lot of difference for WGCNA analysis as long as all samples were processed the same way. These normalization methods make a big difference if one wants to compare expression of gene A to expression of gene B; but WGCNA calculates correlations for which gene-wise scaling factors make no difference. (Sample-wise scaling factors of course do, so samples do need to be normalized.)

If data come from different batches, we recommend to check for batch effects and, if needed, adjust for them. We use ComBat for batch effect removal but other methods should also work.

Finally, we usually check quantile scatterplots to make sure there are no systematic shifts between samples; if sample quantiles show correlations (which they usually do), quantile normalization can be used to remove this effect.

In summary, according to the developer, WGCNA is fine with any normalised data because it's fundamentally based on correlation, which it in part uses to identify the modules. Even though he states this, from my experience, results will differ based on different input distributions (e.g. FPKM or TMM-normalised counts).


Regarding your specific queries:

The matrix should be FPKM values for each gene and each sample. Is there any need to normalize there FPKM values across samples before input ? (For example, quantile normalization or TMM-normalization)

If you have FPKM values, these are already normalised and I would use these if that's all that you've got. You may also consider transforming these to Z-scale via zFPKM package prior to using WGCNA.

Should the matrix include all the genes or only differentially-expressed-genes?

That depends on what you are hoping to achieve by doing WGCNA. Generally, it's conducted as an unsupervised analysis, i.e., nothing is filtered. If you do WGCNA on a DEG-filtered dataset, you'll require a very good excuse for doing this (which is perfectly fine...you'll just have to justify it).

Should the genes with any zero FPKM values be removed before input?

WGCNA can deal with missing values. However read what they say on this here (see point 4, in section Runtime errors).

Should the FPKM values be transformed into log2(FPKM+1) scale before processing?

If you want, sure. Logging FPKM data feels uncomfortable to me, though.

Good luck

Kevin

ADD COMMENT
0
Entering edit mode

Hi Kevin,

Your detailed answer is highly appreciated.

I have a further question.

In fact, I have about 40000 genes with FPKMs for the network construction, and I try to increase maxBlockSize = 40000, in which way the calculation is running extremely slow.

In this case, do you think I need to filter out some of the genes to decrease the number of total genes for this analysis. If yes, what kind of criteria should I apply to remove the genes? (FPKM>? in ? samples)

Many thanks,

Tom

ADD REPLY
0
Entering edit mode

Hi Tom,

I think that 40,000 will eventually run to completion. However, you could remove transcripts/genes with low FPKM values, for example, those with FPKM<10

Take look here also (another Biostars thread with a similar issue): WGCNA maxBlockSize limit

Kevin

ADD REPLY
0
Entering edit mode

Hi Kevin,

Very useful answer thanks!

Just a question about using this on a set of data, If I have a series of 44 samples with two levels within those 44 samples and the aim is to compare the two levels, would it be advisable to use all 44 samples in the one analysis or to separate the two and run a separate analysis on each? In each level the samples are several taken at different points in time.

Many Thanks

ADD REPLY
0
Entering edit mode

With WGCNA, what some do is to analyse them all together and to then correlate your categorical variable (that encodes the two levels) to each resulting module's values. They would have to be encoded numerically. I somewhat question that approach and would also produce a separate network for the two levels, and then compare the resulting modules and the genes within them.

You may want to read through this: Gene co-expression analysis for functional classification and gene-disease predictions. Via my own network approach ( Network plot from expression data in R using igraph ), I always produce 2 networks (e.g. for case control) and then compare the genes across each based on metrics like:

  • hub score
  • betweenness centrality
  • closeness centrality
  • vertex degree
  • community structure
ADD REPLY
0
Entering edit mode

Those are such useful links! Thanks so much for that. Just another quick (and probably stupid/obvious question), how do you deal with duplicate genes in the count matrices? Do you just collapse the counts? Or something else? If so, how would you advise collapsing them?

The Ensembl symbols for the genes are all different, but the gene symbols themselves are sometimes duplicated. Possible doing the WGCNA with the ensembl could work and then rename them later?

ADD REPLY
2
Entering edit mode

No problem. On the other question, if you ask 2 people, you'll get 2 different answers! Some people do prefer to just keep the ensembl IDs as they are, do the entire analysis, and then only report HUGO / HGNC gene names in the final results/figures. Others go ahead and convert ensembl to HGNC from the very start and then take the mean value from those and use that (i.e. the mean) for analysis. Of course, if one is interested in a particular isoform of a gene, then the strategy again has to be different. So, no real answer. Let me see if I can get some other ideas from others...

ADD REPLY
2
Entering edit mode

If you're working with human, mouse or zebrafish, any duplicated gene names with multiple Ensembl gene IDs are likely to be down to patches and haplotypes, which are alternative versions of genomic regions. We have a help video explaining these. Many people choose to ignore the patches and haplotypes, and if you're using the reference files from Ensembl, you can get these by picking the ones named "Primary assembly".

If you're working with something other than human, mouse and zebrafish and you see duplicated names, it's an error, email us helpdesk@ensembl.org, and we will do our best to fix it.

ADD REPLY
0
Entering edit mode

Thanks Kevin for always helpful answers. I've got CPM (count per million) of a few samples (3 treatment and 3 control) obtained by RNA-seq analysis. I'm going to do co-expression analysis, but data cannot be normalized as RNA-seq data follow the negative binomial distribution, yes? however, could you please let me know how to check if the data is normalized? Also, please kindly suggest me a normalization method before doing co-expression analysis. Many thanks in advance.

ADD REPLY
1
Entering edit mode

Hey, from where did you obtain the CPM data? CPM is, technically, already normalised. You could increment (+1) this data and then transform via log [base 2]. Check the distribution via hist() (in R)

ADD REPLY
0
Entering edit mode

My friend produced them. hist () did not show the normal distribution of CPM data. I transformed them to log2(cpm+1), but still not normalized. here is the plot after transformation. Could you please tell me your suggestion?

ADD REPLY
0
Entering edit mode

Do you know how s/he (it?) produced them?

ADD REPLY
0
Entering edit mode

As far as I know, the reads mapped to hg38 via ENCODE RNA-seq pipeline and DESeq2 package used for differentially expressed gene analysis. Kevin, I evaluated the normalization just on DEG, not all genes. Now, I found that the log2(cpm+1), not CPM, of all genes has the normal distribution. However, I'm going to do co-expression analysis on just DEG, so is it possible in this situation (not normalized distribution)?

ADD REPLY
0
Entering edit mode

I see - thank you for explaining. If your metric for the determination of relationships between genes is based on correlation, as is usually the case in co-expression networks, then, technically speaking, your data does not have to follow a normal distribution. However, something like WGCNA is usually performed on the entire dataset. The interpretation of the network changes if you just use the DEGs.

Do you even need to do co-expression analysis?; what is your goal, by doing it?

ADD REPLY
0
Entering edit mode

Hi Kevin, thank you for your comment. Actually, I'm going to create two co-expression networks, one for cases and the other for controls, separately, to find differentially co-expressed modules between two networks and eventually to determine the possible response mechanism to the disease. That's why I wanted to use DEG as input for co-expression analysis by calculating the Pearson correlation coefficients. Could you please share your idea about it? is it an appropriate and powerful method, in terms of both using DEGs (that had not normal distribution even with log2(cpm+1) and Pearson correlation coefficients, not using all genes and WGCNA? however, my sample size is few (3 cases and 3 controls) and I read somewhere that WGCNA is not recommended when the sample size is less than 15.

ADD REPLY
0
Entering edit mode

Ah, that is indeed quite low (sample number). The idea of using WGCNA, though, would be to include all samples and to then see if case|control status correlated with any of the identified modules.

Performing a network analysis separately for cases and controls is not unusual, but you may try a very simplistic approach, like here: Network plot from expression data in R using igraph

I would also use Spearman correlation, given your low sample number and non-normal distribution. Pearson correlation is a parametric test.

ADD REPLY
0
Entering edit mode

Kevin, many thanks for your explaining. Sorry, just two other questions, 1) what's the behind of using all samples for co-expression analysis, how we can compare the important modules between case and control in this case? 2) after log2(x+1) transformation, assuming some samples follow the normal distribution and some other samples don't follow it. Could you please let me know how we shall do in this situation, any tool, any solution?

ADD REPLY
3
Entering edit mode

1) what's the behind of using all samples for co-expression analysis, how we can compare the important modules between case and control in this case?

Similar idea as doing PCA on all samples. If we do PCA, we may notice separation of cases and controls along PC1; so, PC1 would be statistically significantly correlated to case|control status. In WGCNA, I believe, PCA is performed separately on each module, which produces eigenvalues for each module. If we find that, for example, blue module correlates to case|control status, then we can explore further the genes in the blue module.

2) after log2(x+1) transformation, assuming some samples follow the normal distribution and some other samples don't follow it. Could you please let me know how we shall do in this situation, any tool, any solution?

You should check a PCA bi-plot to see if there are sample outliers.

ADD REPLY
1
Entering edit mode

Thank you very much for all your help.

ADD REPLY
0
Entering edit mode

Hello seta. It just occurred to me... if you want to verify 'normality' for each gene, you can use the Shapiro Wilk normality test.

ADD REPLY
1
Entering edit mode

Kevin, Many thanks for sharing it.

ADD REPLY
0
Entering edit mode

Hi Kevin,

I'm dealing with something similar, and wanted to check some things:

I have a lot of public RNA-seqs from different experiments and genotypes sharing similar conditions/tissues. I want to build co-expressions networks out of them.

I did all the process to get the raw counts of every sample. I removed the samples with low counts (I rejected samples with less than 3 million reads in total), and the lowly expressed genes (only kept genes with more than 2 CPMs in at least 80% of samples). I ended up with a pretty OK matrix of >1200 samples x 15k genes, showing a nice bell curve when plus-one-log-transformed.

Now, the next steps are to normalize the counts and then try to deal with batch effects.

I read that instead of log(), a better way of normalization is using a variance-stabilizing-transformation, and here start my doubts:

  • The VST should be applied over the raw counts, right? or the CPMs?!.

  • The VST should be applied to all the (not-filtered) samples together? Or should it be applied per-batch, or per-tissue at least?

  • Either way, if at some point later I reject samples for being outliers, I will have to repeat the VST without the rejected samples, right?

Thanks in advance!

ADD REPLY
1
Entering edit mode

Hey, as you are referring to VST, you must be using DESeq2?

The idea would be to normalise the raw counts (all samples combined) using the standard DESeq() function, and then to apply the vst() function to transform the normalised counts to variance-stabilised expression levels. All samples should be used in both cases.

If using EdgeR, then, after normalisation, the transformation would be done as log CPM + 0.25:

cpm(x, normalized.lib.sizes=TRUE, log=FALSE, prior.count=0.25)
ADD REPLY
0
Entering edit mode

Hi! Thanks for the quick reply!!!

I'm using the varianceStabilizingTransformation() function from DESeq2, but I'm applying it to my raw counts matrix without using the DESeq() function. I'm doing this because I wasn't sure what model should I input to the function nor how exactly that could affect the outcome.

Anyways, if I filter some samples the transformation should be repeated?

Thank you so much!

ADD REPLY
0
Entering edit mode

I'm applying it to my raw counts matrix without using the DESeq() function.

In this case, DESeq2 will detect that no normalisation has been performed and will do it for you automatically.

Anyways, if I filter some samples the transformation should be repeated?

Yes, it is my belief that you should re-normalise after having removed outliers.

ADD REPLY

Login before adding your answer.

Traffic: 2172 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6