Question

How to construct Co-expression network between Coding and lncRNAs?

1

Entering edit mode

5.8 years ago

Biologist ▴ 290

Hi,

I have RNA-seq data. I performed differential analysis between Tumor and Normal samples for coding and lncRNAs separately. I have 1300 differentially expressed coding genes and 140 differentially expressed lncRNAs after differential analysis with DESeq2.

I want to identify the function of lncRNAs and for that I want to construct Co-expression network between coding and lncRNAs.

I have seen some packages like WGCNA and Cemitool for Co-expression network. But I have few questions.

1) For Co-expression network Should I use all the coding and lncRNAs as input (OR) only differentially expressed coding and lncRNAs?

2) Input should be counts data or normalised counts?

Any help is appreciated.

RNA-Seq correlation lncrna co-expression network r • 3.7k views

ADD COMMENT • link updated 3.0 years ago by Kevin Blighe 87k • written 5.8 years ago by Biologist ▴ 290

score 4 · Answer 1 · 2018-07-24

4

Entering edit mode

5.8 years ago

Kevin Blighe 87k

Hey,

You could also look here: Network plot from expression data in R using igraph

1) For Co-expression network Should I use all the coding and lncRNAs as input (OR) only differentially expressed coding and lncRNAs?

You can justify either. If you use all genes, then it's an unbiased / unsupervised approach; if you use the differentially expressed genes, then it's biased / supervised (biased in the sense that your assumption would be that the genes you're using are genuinely differentially expressed - the meaning / interpretation of the network also changes, in this case). You could also do the following: generate separate unsupervised networks for cases and then controls, and then compare these networks.

Fundamentally, what one should keep in mind is that most network analysis approaches are based on correlation. They are by no means reflective of biological pathways, although one could technically 'guide' the construction of a co-expression network by using an already-identified biological pathway.

2) Input should be counts data or normalised counts?

You should use logged [normalised] counts or Z-scores, i.e., preferably a data distribution that generally follows the normal distribution. The standard differential expression analysis tools provide means for transforming data to use downstream: EdgeR, log CPM+1; DESeq2, VST or rlog. If coming from RPKM or FPKM, I would transform these via zFPKM R package.

For further reading: Gene co-expression analysis for functional classification and gene-disease predictions.

Kevin

ADD COMMENT • link 3.0 years ago by Kevin Blighe 87k

0

Entering edit mode

Thank you Kevin. That Research paper is so helpful. I will use the whole data with Z-score for Co-expression analysis. Can I use "Network plot from expression data in R using igraph" which you gave above?

Just now I saw your answer here [How filter genes to construct co-expression network?] You said for WGCNA, normalised data is fine. No need of any logged or Z-scale.

ADD REPLY • link 5.8 years ago by Biologist ▴ 290

0

Entering edit mode

Yes, it is preferable to use data that follows the normal distribution, i.e., logged or Z-scaled. However, if you cannot obtain that, then normalised RNA-seq counts can be used. You will get different results if you compare the same network that was generated from normalsed counts and logged counts, though.

The default correlation metric, Pearson correlation, is a parametric test that expects data to follow the normal distribution.

ADD REPLY • link 5.3 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi Kevin,

A small question. I have RNA-seq data with 17k genes and 421 samples (371 tumor and 50 normals). Among those 17k genes I'm interested in gene FLT3. I would like to check the genes co-expressed with FLT3. and also the pathways enrichment.

So, using the logCPM expression just doing the correlation analysis between FLT3 and other genes, select genes with correlation co-efficient > 0.5 and doing pathway analysis with those genes. Is this the right way for that?

Should I include normal samples for the correlation or should I use only tumors?

ADD REPLY • link 5.5 years ago by Biologist ▴ 290

1

Entering edit mode

Hey, well, if you do this:

using the logCPM expression just doing the correlation analysis between FLT3 and other genes, select genes with correlation co-efficient > 0.5 and doing pathway analysis with those genes.

...that is not quite network analysis as most think. It's simply correlating your gene of interest to everything else and then doing pathway analysis on the genes that are highly positively/negatively correlated. I'm not sure what it would show. If you do this, then you should at least only pick those correlations that have a statistically significant p-value from cor.test()

Note that if you do a 'proper' network analysis (for example: WGCNA or Network plot from expression data in R using igraph ), then, in the case of WGCNA, you can identify the module in which FLT3 is located and perform pathway analysis on those genes, or, in the case of the igraph tutorial, you can identify the 1st neighbours of FLT3 and/or the community structure in which it's located, and also do pathway analysis on those.

I would do the networks for tumour and normal separately and then cross-compare results. Should be interesting.

ADD REPLY • link 4.7 years ago by Kevin Blighe 87k

0

Entering edit mode

Yes, what you said was right. I came across an R package cemitool [https://bioconductor.org/packages/release/bioc/vignettes/CEMiTool/inst/doc/CEMiTool.html] in which it does filtering for the input before classifying them into modules. My input is 17k genes. And based on filtering step there are only 1600 genes for the analysis. With that filtering step I didn't see my interested gene FLT3 in those 1600 genes. It was filtered out.

PS: Cemitool has some advantages compared to WGCNA

ADD REPLY • link 5.5 years ago by Biologist ▴ 290

0

Entering edit mode

Thanks - that looks pretty good, actually (CEMiTool). I see that it's from a group in Brazil, too. I speak fluent Portuguese.

ADD REPLY • link 4.7 years ago by Kevin Blighe 87k

0

Entering edit mode

So, you are telling me that using yours is fine?

1) If yes, please tell me one thing. In this [Network plot from expression data in R using igraph] I see you calculated significant effects first and then based on that you took only those genes for further analysis.

I have logCPM expression data for 17k genes in 371 tumor samples, which is after filtering out all the genes with 0 counts. With this 17k genes logCPM expression data can I go directly to step 2 or do I need to follow anything else in step 1?

2) If I use interested gene vs other genes and do correlation, you said me to consider p.value right. I have used following function to get the correlation coefficient between the genes.

cor <- apply(logCPM,1,function(x){cor(x,logCPM[i,], method = "spearman")})

with the above function I see only coefficient values but how to get p.values?

ADD REPLY • link 5.5 years ago by Biologist ▴ 290

1

Entering edit mode

The tutorial that I made is mainly for small datasets. It is not optimised for very large correlation matrices, like yours with 17k genes. WGCNA works with very large data matrices, though. In network analysis, some people use only differentially expressed genes, while others use the entire dataset... the interpretation in each case has to be different.

For your second part, to obtain the p-value associated with the correlation test, just use cor.test() in place of cor()

ADD REPLY • link 5.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Thank you. Yes, I used that everything is stored in a list. Any idea how I can make it to a dataframe with all genes in one column and values in one column etc....

ADD REPLY • link 5.5 years ago by Biologist ▴ 290

0

Entering edit mode

A data-frame is technically a list. you may have to paste a sample of the data that you have so that I can see what to do.

ADD REPLY • link 5.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Lets say I have data df like below:

                       Sample1            Sample2           Sample3          Sample4          Sample5
Gene1        17.146506        16.822596        16.781746        16.932891        16.263722
Gene2          6.782761         7.941372         8.520003         8.241359         7.797734
Gene3        16.279996        16.663848        15.908999        14.737590        15.665799
Gene4        15.347626        15.454124        15.211375        15.686339        16.339990
Gene5        15.546598        15.720200        15.331334        15.262918        15.766690

I used the function like below:

cor <- apply(logCPM,1,function(x){cor.test(x,logCPM["Gene2",], method = "spearman")})

cor is seen as a list. Instead I want to see the result in dataframe with columns Genes, pvalues, correlation coefficient etc.....

ADD REPLY • link 5.5 years ago by Biologist ▴ 290

1

Entering edit mode

Oh, I see. You should try this:

logCPM
        Sample1   Sample2   Sample3   Sample4   Sample5
Gene1 17.146506 16.822596 16.781746 16.932891 16.263722
Gene2  6.782761  7.941372  8.520003  8.241359  7.797734
Gene3 16.279996 16.663848 15.908999 14.737590 15.665799
Gene4 15.347626 15.454124 15.211375 15.686339 16.339990
Gene5 15.546598 15.720200 15.331334 15.262918 15.766690

apply(logCPM, 1, function(x) cor.test(x, as.numeric(logCPM["Gene2",]), method="spearman")$p.value)
     Gene1      Gene2      Gene3      Gene4      Gene5 
0.68333333 0.01666667 0.68333333 0.68333333 0.35000000

apply(logCPM, 1, function(x) cor(x, as.numeric(logCPM["Gene2",]), method="spearman"))
Gene1 Gene2 Gene3 Gene4 Gene5 
 -0.3   1.0  -0.3  -0.3  -0.6

The outputs here are named lists, i.e., key-value pairs. You can easily cbind them together into a data-frame of results.

ADD REPLY • link 5.5 years ago by Kevin Blighe 87k

1

Entering edit mode

Thank you very much Kevin

ADD REPLY • link 5.5 years ago by Biologist ▴ 290

0

Entering edit mode

Hi Kevin,

Have a small question. From the above comment as you said

in the case of WGCNA, you can identify the module in which FLT3 is located and perform pathway analysis on those genes,

Lets say FLT3 is identified in the module M4 and there are approx 5000 genes in M4. Does this mean that FLT3 is co-expressed with all the genes in M4 and not with other genes in other modules?

ADD REPLY • link 5.4 years ago by Biologist ▴ 290

1

Entering edit mode

Yes, well, 'co-expressed' based on the method used to define co-expression, which is [here] WGCNA. Remember that WGCNA is fundamentally based on correlation. Also, with WGCNA and other network analysis programs, small changes in the parameters can result in large changes at the end. So, one has to really be careful with the end interpretation.

With your 5000 genes, I would perform GO enrichment and KEGG pathway analysis to see to what they may relate.

ADD REPLY • link 5.4 years ago by Kevin Blighe 87k

0

Entering edit mode

Thanks Kevin. Small help. from the Module M4 I want to make a small co-expression network of FLT3 with some important genes I'm interested in. How can I do that?

ADD REPLY • link 5.4 years ago by Biologist ▴ 290