Hey,
You could also look here: Network plot from expression data in R using igraph
1) For Co-expression network Should I use all the coding and lncRNAs
as input (OR) only differentially expressed coding and lncRNAs?
You can justify either. If you use all genes, then it's an unbiased / unsupervised approach; if you use the differentially expressed genes, then it's biased / supervised (biased in the sense that your assumption would be that the genes you're using are genuinely differentially expressed - the meaning / interpretation of the network also changes, in this case). You could also do the following: generate separate unsupervised networks for cases and then controls, and then compare these networks.
Fundamentally, what one should keep in mind is that most network analysis approaches are based on correlation. They are by no means reflective of biological pathways, although one could technically 'guide' the construction of a co-expression network by using an already-identified biological pathway.
2) Input should be counts data or normalised counts?
You should use logged [normalised] counts or Z-scores, i.e., preferably a data distribution that generally follows the normal distribution. The standard differential expression analysis tools provide means for transforming data to use downstream: EdgeR, log CPM+1
; DESeq2, VST
or rlog
. If coming from RPKM or FPKM, I would transform these via zFPKM R package.
For further reading: Gene co-expression analysis for functional classification and gene-disease predictions.
Kevin
Thank you Kevin. That Research paper is so helpful. I will use the whole data with Z-score for Co-expression analysis. Can I use "Network plot from expression data in R using igraph" which you gave above?
Just now I saw your answer here [How filter genes to construct co-expression network?] You said for WGCNA, normalised data is fine. No need of any logged or Z-scale.
Yes, it is preferable to use data that follows the normal distribution, i.e., logged or Z-scaled. However, if you cannot obtain that, then normalised RNA-seq counts can be used. You will get different results if you compare the same network that was generated from normalsed counts and logged counts, though.
The default correlation metric, Pearson correlation, is a parametric test that expects data to follow the normal distribution.
Hi Kevin,
A small question. I have RNA-seq data with 17k genes and 421 samples (371 tumor and 50 normals). Among those 17k genes I'm interested in gene FLT3. I would like to check the genes co-expressed with FLT3. and also the pathways enrichment.
So, using the logCPM expression just doing the correlation analysis between FLT3 and other genes, select genes with correlation co-efficient > 0.5 and doing pathway analysis with those genes. Is this the right way for that?
Should I include normal samples for the correlation or should I use only tumors?
Hey, well, if you do this:
...that is not quite network analysis as most think. It's simply correlating your gene of interest to everything else and then doing pathway analysis on the genes that are highly positively/negatively correlated. I'm not sure what it would show. If you do this, then you should at least only pick those correlations that have a statistically significant p-value from
cor.test()
Note that if you do a 'proper' network analysis (for example: WGCNA or Network plot from expression data in R using igraph ), then, in the case of WGCNA, you can identify the module in which FLT3 is located and perform pathway analysis on those genes, or, in the case of the igraph tutorial, you can identify the 1st neighbours of FLT3 and/or the community structure in which it's located, and also do pathway analysis on those.
I would do the networks for tumour and normal separately and then cross-compare results. Should be interesting.
Yes, what you said was right. I came across an R package cemitool [https://bioconductor.org/packages/release/bioc/vignettes/CEMiTool/inst/doc/CEMiTool.html] in which it does filtering for the input before classifying them into modules. My input is 17k genes. And based on filtering step there are only 1600 genes for the analysis. With that filtering step I didn't see my interested gene FLT3 in those 1600 genes. It was filtered out.
PS: Cemitool has some advantages compared to WGCNA
Thanks - that looks pretty good, actually (CEMiTool). I see that it's from a group in Brazil, too. I speak fluent Portuguese.
So, you are telling me that using yours is fine?
1) If yes, please tell me one thing. In this [Network plot from expression data in R using igraph] I see you calculated significant effects first and then based on that you took only those genes for further analysis.
I have logCPM expression data for 17k genes in 371 tumor samples, which is after filtering out all the genes with 0 counts. With this 17k genes logCPM expression data can I go directly to step 2 or do I need to follow anything else in step 1?
2) If I use interested gene vs other genes and do correlation, you said me to consider p.value right. I have used following function to get the correlation coefficient between the genes.
with the above function I see only coefficient values but how to get p.values?
The tutorial that I made is mainly for small datasets. It is not optimised for very large correlation matrices, like yours with 17k genes. WGCNA works with very large data matrices, though. In network analysis, some people use only differentially expressed genes, while others use the entire dataset... the interpretation in each case has to be different.
For your second part, to obtain the p-value associated with the correlation test, just use
cor.test()
in place ofcor()
Thank you. Yes, I used that everything is stored in a list. Any idea how I can make it to a dataframe with all genes in one column and values in one column etc....
A data-frame is technically a list. you may have to paste a sample of the data that you have so that I can see what to do.
Lets say I have data df like below:
I used the function like below:
cor is seen as a list. Instead I want to see the result in dataframe with columns Genes, pvalues, correlation coefficient etc.....
Oh, I see. You should try this:
The outputs here are named lists, i.e., key-value pairs. You can easily cbind them together into a data-frame of results.
Thank you very much Kevin
Hi Kevin,
Have a small question. From the above comment as you said
Lets say FLT3 is identified in the module M4 and there are approx 5000 genes in M4. Does this mean that FLT3 is co-expressed with all the genes in M4 and not with other genes in other modules?
Yes, well, 'co-expressed' based on the method used to define co-expression, which is [here] WGCNA. Remember that WGCNA is fundamentally based on correlation. Also, with WGCNA and other network analysis programs, small changes in the parameters can result in large changes at the end. So, one has to really be careful with the end interpretation.
With your 5000 genes, I would perform GO enrichment and KEGG pathway analysis to see to what they may relate.
Thanks Kevin. Small help. from the Module M4 I want to make a small co-expression network of FLT3 with some important genes I'm interested in. How can I do that?