Question: How to reduce my dataset in order to plot it?
1
gravatar for vincentpailler
13 months ago by
vincentpailler140 wrote:

Hello,

I have a dataset with the correlations between genes and OTUs. I want to plot these correlations with the igraph library in R in order to know what genes are correlated with which OTU. Then, I will extract the different components (each component should represent a genome).

My dataset is very huge : I can't keep all the correlations (in the range [-1,1]), which gives a huge dataset (817.000*817.000 correlations). So, I want to select a threshold : is there a good way to set a good threshold? I mean, if I only keep the correlations > 0.9 , is it meaningful? I keep more than 58 million correlations if I do that. That creates 9152 components.

Another point is to know if I should only keep the correlations OTU-gene? Is it still meaningul to keep the correlations OTU-OTU and gene-gene? If I only keep the correlations OTU-gene > 0.8 , I keep more than 1,1 million correlations. That creates only 89 components.

Thanks

igraph correlation • 420 views
ADD COMMENTlink modified 11 months ago by Biostar ♦♦ 20 • written 13 months ago by vincentpailler140

What are you trying to achieve? Even after stringent filtering you will most likely still have too much data for useful visualization. Graph visualizations quickly degenerate into useless hairballs when the number of nodes grows. To get genes correlated with OTUs, you could try a (bi)clustering approach.

ADD REPLYlink written 13 months ago by Jean-Karim Heriche22k

Actually, I don't really want to get a useful visualization. I only want to get the components : I suppose that these ones will be compound of OTUs (one or more) with several genes. After that, I will consider these components as genomes.

ADD REPLYlink written 13 months ago by vincentpailler140

So this can be framed as a clustering problem.

ADD REPLYlink written 13 months ago by Jean-Karim Heriche22k

I know what you mean. Another point concerns the components found by igraph . The first one is much bigger than the others (no matter the data.frame of correlations I import) : is it meaningful to take this component aside in order to reclusterise it?

ADD REPLYlink written 13 months ago by vincentpailler140

Yes. Connected components are the first level of structure in the graph but in each one you may have weak connections due to noise so it is common to apply clustering to connected components separately.

ADD REPLYlink written 13 months ago by Jean-Karim Heriche22k

Is there a igraph function which allows to apply clustering to connected components separately?

ADD REPLYlink written 13 months ago by vincentpailler140

Just extract the submatrix corresponding to each connected component and use it as input to your clustering algorithm of choice.

ADD REPLYlink written 13 months ago by Jean-Karim Heriche22k

Actually, I extract the giant component with dg<-decompose.graph(g) l<-get.data.frame(dg[[1]]) . It returns the data.frame with the correlation between my OTU and genes corresponding to the giant component. If I try to make a graph on this data.frame, does it necessarly return me the same connected component (the giant component) ?

ADD REPLYlink modified 13 months ago • written 13 months ago by vincentpailler140

I found the package biclust which could be interesting to apply clustering on my giant component. The problem is that I have my data in the form of data.frame rather than matrix. Do you know this package, and if yes, if it could be suitable to my data.set and my analysis?

ADD REPLYlink written 13 months ago by vincentpailler140
myMatrix <- as.matrix(myDataFrame)

I don't know this package but looks like a good place to start.

ADD REPLYlink written 13 months ago by Jean-Karim Heriche22k

The dataframe I extract from the giant component looks like this (it is only the head) :

 var1                var2  corr
1  OTU3978 UniRef90_A0A010P3Z8 0.846
2  OTU4011 UniRef90_A0A010P3Z8 0.855
3  OTU4929 UniRef90_A0A010P3Z8 0.829
4  OTU4317 UniRef90_A0A011P550 0.850
5  OTU4816 UniRef90_A0A011P550 0.807
6  OTU3902 UniRef90_A0A011QPQ2 0.836
7  OTU3339 UniRef90_A0A011RKI6 0.835

I used the library(reshape2) to put it in the form of matrix : matrix=acast(df, var1~var2, value.var="corr") . The rows correspond to var1 and the columns to var2. Is it meaningful to proceed like this (should I create a matrix with all the OTUXXX and UniRef90_XXX for both rows and columns?)

ADD REPLYlink modified 13 months ago • written 13 months ago by vincentpailler140

By the way, when I apply different clustering methods (biclustering, kmeans) , I will get only one cluster, which corresponds to my giant component. I don't know how to apply a "real" clustering on it.

ADD REPLYlink written 13 months ago by vincentpailler140

You have to apply clustering to each connected component separately. You can't get only one cluster with k-means because it will always find the number of clusters provided as input parameter.

ADD REPLYlink written 13 months ago by Jean-Karim Heriche22k

That's what I meant. I applied clustering on the matrix which corresponds to my giant component, and I get back only one cluster. Maybe I don't understand your draft.. I can extract each connected component, make the matrix for each of them and then apply clustering. When you say " You have to apply clustering to each connected component separately." : that's what I did, I applied clustering on the giant component and not on others because I don't want to apply clustering on the others (but I could do it)

ADD REPLYlink written 13 months ago by vincentpailler140

Yes this is the matrix you should be using as input. I thought your data was already like this.

should I create a matrix with all the OTUXXX and UniRef90_XXX for both rows and columns?

No unless you want to consider OTUs and Unirefs as equivalent.

ADD REPLYlink modified 13 months ago • written 13 months ago by Jean-Karim Heriche22k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1194 users visited in the last hour