**28k**wrote:

This tutorial will allow you to:

- create graph and tree objects from a data-frame or -matrix of numerical data
- identify community structure in your graph and tree objects
- Perform some basic downstream analyses on the tree objects

For simple testing, we'll use the estrogen microarray data-set in R and follow the tutorial from Rob Gentleman and Wolfgang Huber. This dataset was looking at genes that respond to estrogen stimulation.

**The method can be applied to any type of numerical data, be it microarray, RNA-seq, metabolomics, qPCR, etc.**

# Step 1, Read in, normalise, and identify genes with significant effects

`NB - IF YOU ALREADY HAVE YOUR OWN DATA, PROCEED TO STEP 2`

`[Si ya tiene sus propios datos, se va a Step 2]`

`[Se você já tem seus próprios dados, vá para a Step 2]`

```
library(affy)
library(estrogen)
library(vsn)
library(genefilter)
datadir <- system.file("extdata", package="estrogen")
dir(datadir)
setwd(datadir)
#Read in the phenotype data and the raw expression 'CEL' files
pd <- read.AnnotatedDataFrame("estrogen.txt", header=TRUE, sep="", row.names=1)
a <- ReadAffy(filenames=rownames(pData(pd)), phenoData=pd, verbose=TRUE)
#Normalise the data
x <- expresso(
a,
bg.correct=FALSE,
normalize.method="vsn",
normalize.param=list(subsample=1000),
pmcorrect.method="pmonly",
summary.method="medianpolish"
)
#NB - if the vsn normalisation does not function, use:
#x <- expresso(a, bgcorrect.method="rma", normalize.method="constant", pmcorrect.method="pmonly", summary.method="avgdiff")
#Remove control probes
controlProbeIdx <- grep("^AFFX", rownames(x))
x <- x[-controlProbeIdx,]
#Identify genes of significant effect
lm.coef <- function(y) lm(y ~ estrogen * time.h)$coefficients
eff <- esApply(x, 1, lm.coef)
effectUp <- names(sort(eff[2,], decreasing=TRUE)[1:25])
effectDown <- names(sort(eff[2,], decreasing=FALSE)[1:25])
main.effects <- c(effectUp, effectDown)
#Filter the expression set object to include only genes of significant effect
estrogenMainEffects <- exprs(x)[main.effects,]
```

We now have the top 25 genes showing statistically significant effects in both directions and have filtered our expression object to include only these genes. The object that we'll use for further analysis is *estrogenMainEffects*

```
head(estrogenMainEffects)
low10-1.cel low10-2.cel high10-1.cel high10-2.cel low48-1.cel
910_at 8.869798 8.863362 11.23440 11.12650 8.399458
31798_at 10.217214 10.121984 12.79881 12.33325 10.372707
40117_at 8.916664 8.809936 10.82381 10.72865 8.887119
1884_s_at 8.591428 8.406488 10.16884 10.33448 8.433163
947_at 9.777758 9.813295 11.72156 11.40366 10.106400
38116_at 9.126792 8.957475 11.15331 10.80998 8.152348
low48-2.cel high48-1.cel high48-2.cel
910_at 8.400309 10.65822 10.597463
31798_at 10.112305 13.61006 13.234540
40117_at 8.669938 10.56590 10.000753
1884_s_at 8.446880 10.17888 9.533156
947_at 9.730117 11.86636 11.073517
38116_at 7.870085 10.72301 10.082670
```

# Step 2, Create the graph and tree objects

In this example, we now have our data-frame/-matrix stored as *estrogenMainEffects* and will construct a graph adjacency object from this. *estrogenMainEffects* can be any numerical data-frame/-matrix from you own experiment(s).

```
library(igraph)
#Create a graph adjacency based on correlation distances between genes in pairwise fashion.
g <- graph.adjacency(
as.matrix(as.dist(cor(t(estrogenMainEffects), method="pearson"))),
mode="undirected",
weighted=TRUE,
diag=FALSE
)
#NB - Euclidean distances can also be used instead of correlation, but this changes the tutorial slightly:
#g <- graph.adjacency(
# as.matrix(dist(estrogenMainEffects, method="euclidean")),
# mode="undirected",
# weighted=TRUE,
# diag=FALSE
#)
#Simplfy the adjacency object
g <- simplify(g, remove.multiple=TRUE, remove.loops=TRUE)
#Colour negative correlation edges as blue
E(g)[which(E(g)$weight<0)]$color <- "darkblue"
#Colour positive correlation edges as red
E(g)[which(E(g)$weight>0)]$color <- "darkred"
#Convert edge weights to absolute values
E(g)$weight <- abs(E(g)$weight)
#Change arrow size
#For directed graphs only
#E(g)$arrow.size <- 1.0
#Remove edges below absolute Pearson correlation 0.8
g <- delete_edges(g, E(g)[which(E(g)$weight<0.8)])
#Assign names to the graph vertices (optional)
V(g)$name <- V(g)$name
#Change shape of graph vertices
V(g)$shape <- "sphere"
#Change colour of graph vertices
V(g)$color <- "skyblue"
#Change colour of vertex frames
V(g)$vertex.frame.color <- "white"
#Scale the size of the vertices to be proportional to the level of expression of each gene represented by each vertex
#Multiple scaled vales by a factor of 10
scale01 <- function(x){(x-min(x))/(max(x)-min(x))}
vSizes <- (scale01(apply(estrogenMainEffects, 1, mean)) + 1.0) * 10
#Amplify or decrease the width of the edges
edgeweights <- E(g)$weight * 2.0
#Convert the graph adjacency object into a minimum spanning tree based on Prim's algorithm
mst <- mst(g, algorithm="prim")
#Plot the tree object
plot(
mst,
layout=layout.fruchterman.reingold,
edge.curved=TRUE,
vertex.size=vSizes,
vertex.label.dist=-0.5,
vertex.label.color="black",
asp=FALSE,
vertex.label.cex=0.6,
edge.width=edgeweights,
edge.arrow.mode=0,
main="My first graph"
)
```

Note that layout can be modified to various other values, including:

- layout.fruchterman.reingold
- layout.kamada.kawai
- layout.lgl
- layout.circle
- layout.graphopt

For further information, see Graph layouts

# Step 3, Identify communities in the tree object based on 'edge betweenness'

```
mst.communities <- edge.betweenness.community(mst, directed=FALSE)
mst.clustering <- make_clusters(mst, membership=mst.communities$membership)
V(mst)$color <- mst.communities$membership + 1
par(mfrow=c(1,2))
plot(
mst.clustering, mst,
layout=layout.fruchterman.reingold,
edge.curved=TRUE,
vertex.size=vSizes,
vertex.label.dist=-0.5,
vertex.label.color="black",
asp=FALSE,
vertex.label.cex=0.6,
edge.width=edgeweights,
edge.arrow.mode=0,
main="My first graph"
)
plot(
mst,
layout=layout.fruchterman.reingold,
edge.curved=TRUE,
vertex.size=vSizes,
vertex.label.dist=-0.5,
vertex.label.color="black",
asp=FALSE,
vertex.label.cex=0.6,
edge.width=edgeweights,
edge.arrow.mode=0,
main="My first graph"
)
```

Other community identification methods, including *fastgreedy.community*, can be found at Detecting Community Structure.

# Step 4, Further analyses

```
#Check the vertex degree, i.e., number of connections to each vertex
degree(mst)
#Output information for each community, including vertex-to-community assignments and modularity
commSummary <- data.frame(
mst.communities$names,
mst.communities$membership,
mst.communities$modularity
)
colnames(commSummary) <- c("Gene", "Community", "Modularity")
options(scipen=999)
commSummary
Gene Community Modularity
1 910_at 1 -0.0281615237929416124818
2 31798_at 2 -0.0086268060222194423159
3 40117_at 3 0.0117105668413069411576
4 1884_s_at 4 0.0311655447994580621363
5 38116_at 5 0.0518542468936595488116
6 947_at 6 0.0715756886160198307900
7 41400_at 4 0.0905925646858431743436
....
#Compare community structures using the variance of information (vi) metric (not relevant here)
#Community structures that are identical will have a vi=0
compare(mst.communities, mst.communities, method="vi")
[1] 0
```

**Edit May 16, 2018:**
Other metrics that one can observe include:

- hub score
- closeness centrality
- betweenness centrality

In a case-control study, a useful approach is to generate separate networks for cases and controls and then compare the genes in these based on these metrics.

# Session info (basic)

```
R version 3.2.5 (2016-04-14)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS
```

Suggestion: add

`microarray`

tag? Also include that in the title? Or are you using the microarray data only becuase it is easily available?55kThanks genomax. I've added

microarrayandrna-seq. Technically, the method can be applied to any type of data though. I originally pieced together the pipeline on metabolomics data.28kThanks @ Kevin.

9.0kDear @Kevin Blighe, Hi and thank you. Can I use

igraphfor myTrinityde novo RNA-seq assembly? I have assembled my reads then performed DEG analysis using Trinity. which file I can use as input ofigraph? Thanks again.3.1kDear Farbod,

All that you require is a data-frame or data-matrix of numerical data, preferably binomial distributed, as the method is based on Pearson correlation.

If your data-frame/-matrix is

MyData, then the starting point is:You can also use Euclidean distances instead of correlation, with:

28kThank you

Trinity produces several kind of matrix. for example the first line of "Trans_counts.TMM.EXPR.matrix" is as:

can I use that ?

3.1kIt looks fine.

The code in the tutorial above can be used in very diverse ways and the meaning/significance of the graph will change depending on the data that you provide. If you are supplying it data from Trinity, then I presume that it is

de novotranscriptome assembly transcripts? The graph of these would then show transcripts whose expression is highly positively or inversely correlated.28kYes it is

de novobutPlease tell me what do you mean by "ranscripts whose expression is highly positively or inversely correlated"? are you telling me that the result graph would have some sort of skew or error? or it would be just a great candidate for my data?

3.1kHey Farbod, the tutorial first generates a square correlation matrix of your data. It answers the question: '

Which genes are highly correlated with each other?' The graph object is then created from this correlation matrix. Thus, the edges in your graph will connect genes that are highly positively or inversely correlated. Closely related genes will also be grouped together in the same community.You should probably ask yourself this question: why do you want to perform network analysis on your data and what do you believe it will add to your project?

In the example that I provide in the tutorial, I first filter the example dataset to include

onlygenes that are statistically significantly related to treatment group and time (n=50). The resulting plot, then, shows the potential relationships between just these significant genes.28kThank you Kevin, I got some errors in step1 data normalization

normalization: vsn

PM/MM correction : pmonly

expression values: medianpolish

1.0ktry loading library "vsn" (

`library(vsn)`

)9.0kHi Mike, did you install the

vsnpackage? - Does`library(vsn)`

return an error?If you already have your own data, then you can start from the second step '

Create the graph and tree objects'28kyes I installed vsn and loaded

`library(vsn)`

then I check normalize.AffyBatch.vsn, but I couldnt get it...

normalize.AffyBatch.vsn package:vsn R Documentation

Wrapper for vsn to be used as a normalization method with expresso

Description:

1.0kThe normalisation method is not too important just for the tutorial. Could you try:

28kThank you very much, Kevin

1.0kThen watch out for the control probes step!

Use this instead of the original line:

28kHi again Mike, I now remove those control probes just after normalisation. So, they shouldn't be a problem. I also added a note about VSN possibly not working.

Thanks for your feedback!

28kHi, I have used it on my whole Trinity matrix file just for test and I received the following error:

is it some separator issues?

3.1kYou may have to read the file 'home/trans_counts.TMM.EXPR.matrix' into a data-frame? Something like

28kThank you, there is another "x" and "numeric" issue:

3.1kTry

`as.matrix(as.dist(cor(t(data.matrix(MyData)), method="pearson")))`

28kHi, it seems that it is a problem with huge matrix file so I did it simple and just put 100 lines in the matrix. I want to know by which command I can create the plot? is it just

`plot()`

?(sorry for simple questions)3.1kYes, it will be difficult to produce a network of a large data-matrix because it has to compute a squared distance/correlation matrix. For example, if you tried it on a transcriptome of 20,000 protein-coding genes, you would have to generate a distance matrix of 400,000,000 data points.

You can generate a quick plot with just the base R

`plot()`

function, which will interpret the graph object.`plot.igraph`

should also work. However, between the graph adjacency object and actually plotting the graph, there are many other important steps (as you can see above in the tutorial)28kWhat is the significance of the Modularity column in the final commSummary data frame? To my understanding, modularity is a singular value that speaks to the entire network - why is each node assigned its own value?

0In this context, it relates to the community structure, not the entire graph. Also, based on the fact that I have used edge.betweenness.community() for the purpose of identifying community structure in this tutorial, the modularity score relates to the maximum score achieved for each node when identifying the community structures.

[source: http://igraph.org/r/doc/cluster_edge_betweenness.html]Does that help? - hope so.

28k