As you've mentioned, WGCNA can indeed be used for RNA-seq data. They use WGCNA for all types of data in the lab where I was based in Boston, which is where the developer of WGCNA happened to also do postdoctoral work. He even states it on his website:
Can WGCNA be used to analyze RNA-Seq data?
Yes. As far as WGCNA is concerned, working with (properly normalized)
RNA-seq data isn't
really any different from working with (properly normalized)
We suggest removing features whose counts are consistently low (for
example, removing all features that have a count of less than say 10
in more than 90% of the samples) because such low-expressed features
tend to reflect noise and correlations based on counts that are mostly
zero aren't really meaningful. The actual thresholds should be based
on experimental design, sequencing depth and sample counts.
We then recommend a variance-stabilizing transformation. For example,
package DESeq2 implements the function
varianceStabilizingTransformation which we have found useful, but one
could also start with normalized counts (or RPKM/FPKM data) and
log-transform them using log2(x+1). For highly expressed features, the
differences between full variance stabilization and a simple log
transformation are small.
Whether one uses RPKM, FPKM, or simply normalized counts doesn't make
a whole lot of difference for WGCNA analysis as long as all samples
were processed the same way. These normalization methods make a big
difference if one wants to compare expression of gene A to expression
of gene B; but WGCNA calculates correlations for which gene-wise
scaling factors make no difference. (Sample-wise scaling factors of
course do, so samples do need to be normalized.)
If data come from different batches, we recommend to check for batch
effects and, if needed, adjust for them. We use ComBat for batch
effect removal but other methods should also work.
Finally, we usually check quantile scatterplots to make sure there are
no systematic shifts between samples; if sample quantiles show
correlations (which they usually do), quantile normalization can be
used to remove this effect.
In summary, according to the developer, WGCNA is fine with any normalised data because it's fundamentally based on correlation, which it in part uses to identify the modules. Even though he states this, from my experience, results will differ based on different input distributions (e.g. FPKM or TMM-normalsied counts).
Regarding your specific queries:
The matrix should be FPKM values for each gene and each sample. Is
there any need to normalize there FPKM values across samples before
input ? (For example, quantile normalization or TMM-normalization)
If you have FPKM values, these are already normalised and I would use these if that's all that you've got. You may also consider transforming these to Z-scale via zFPKM package prior to using WGCNA.
Should the matrix include all the genes or only
That depends on what you are hoping to achieve by doing WGCNA. Generally, it's conducted as an unsupervised analysis, i.e., nothing is filtered. If you do WGCNA on a DEG-filtered dataset, you'll require a very good excuse for doing this (which is perfectly fine...you'll just have to justify it).
Should the genes with any zero FPKM values be removed before input?
WGCNA can deal with missing values. However read what they say on this here (see point 4).
Should the FPKM values be transformed into log2(FPKM+1) scale before
If you want, sure. Logging FPKM data feels uncomfortable to me, though.