Question

How to reduce my dataset in order to plot it?

1

Entering edit mode

4.9 years ago

pablo ▴ 300

Hello,

I have a dataset with the correlations between genes and OTUs. I want to plot these correlations with the igraph library in R in order to know what genes are correlated with which OTU. Then, I will extract the different components (each component should represent a genome).

My dataset is very huge : I can't keep all the correlations (in the range [-1,1]), which gives a huge dataset (817.000*817.000 correlations). So, I want to select a threshold : is there a good way to set a good threshold? I mean, if I only keep the correlations > 0.9 , is it meaningful? I keep more than 58 million correlations if I do that. That creates 9152 components.

Another point is to know if I should only keep the correlations OTU-gene? Is it still meaningul to keep the correlations OTU-OTU and gene-gene? If I only keep the correlations OTU-gene > 0.8 , I keep more than 1,1 million correlations. That creates only 89 components.

Thanks

correlation igraph • 1.5k views

ADD COMMENT • link updated 4.7 years ago by Biostar 20 • written 4.9 years ago by pablo ▴ 300

0

Entering edit mode

What are you trying to achieve? Even after stringent filtering you will most likely still have too much data for useful visualization. Graph visualizations quickly degenerate into useless hairballs when the number of nodes grows. To get genes correlated with OTUs, you could try a (bi)clustering approach.

ADD REPLY • link 4.9 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Actually, I don't really want to get a useful visualization. I only want to get the components : I suppose that these ones will be compound of OTUs (one or more) with several genes. After that, I will consider these components as genomes.

ADD REPLY • link 4.9 years ago by pablo ▴ 300

0

Entering edit mode

So this can be framed as a clustering problem.

ADD REPLY • link 4.9 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

I know what you mean. Another point concerns the components found by igraph . The first one is much bigger than the others (no matter the data.frame of correlations I import) : is it meaningful to take this component aside in order to reclusterise it?

ADD REPLY • link 4.9 years ago by pablo ▴ 300

0

Entering edit mode

Yes. Connected components are the first level of structure in the graph but in each one you may have weak connections due to noise so it is common to apply clustering to connected components separately.

ADD REPLY • link 4.9 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Is there a igraph function which allows to apply clustering to connected components separately?

ADD REPLY • link 4.9 years ago by pablo ▴ 300

0

Entering edit mode

Just extract the submatrix corresponding to each connected component and use it as input to your clustering algorithm of choice.

ADD REPLY • link 4.9 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Actually, I extract the giant component with dg<-decompose.graph(g) l<-get.data.frame(dg[[1]]) . It returns the data.frame with the correlation between my OTU and genes corresponding to the giant component. If I try to make a graph on this data.frame, does it necessarly return me the same connected component (the giant component) ?

ADD REPLY • link 4.9 years ago by pablo ▴ 300

0

Entering edit mode

I found the package biclust which could be interesting to apply clustering on my giant component. The problem is that I have my data in the form of data.frame rather than matrix. Do you know this package, and if yes, if it could be suitable to my data.set and my analysis?

ADD REPLY • link 4.9 years ago by pablo ▴ 300

0

Entering edit mode

myMatrix <- as.matrix(myDataFrame)

I don't know this package but looks like a good place to start.

ADD REPLY • link 4.9 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

The dataframe I extract from the giant component looks like this (it is only the head) :

 var1                var2  corr
1  OTU3978 UniRef90_A0A010P3Z8 0.846
2  OTU4011 UniRef90_A0A010P3Z8 0.855
3  OTU4929 UniRef90_A0A010P3Z8 0.829
4  OTU4317 UniRef90_A0A011P550 0.850
5  OTU4816 UniRef90_A0A011P550 0.807
6  OTU3902 UniRef90_A0A011QPQ2 0.836
7  OTU3339 UniRef90_A0A011RKI6 0.835

I used the library(reshape2) to put it in the form of matrix : matrix=acast(df, var1~var2, value.var="corr") . The rows correspond to var1 and the columns to var2. Is it meaningful to proceed like this (should I create a matrix with all the OTUXXX and UniRef90_XXX for both rows and columns?)

ADD REPLY • link 4.9 years ago by pablo ▴ 300

0

Entering edit mode

By the way, when I apply different clustering methods (biclustering, kmeans) , I will get only one cluster, which corresponds to my giant component. I don't know how to apply a "real" clustering on it.

ADD REPLY • link 4.9 years ago by pablo ▴ 300

0

Entering edit mode

You have to apply clustering to each connected component separately. You can't get only one cluster with k-means because it will always find the number of clusters provided as input parameter.

ADD REPLY • link 4.9 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

That's what I meant. I applied clustering on the matrix which corresponds to my giant component, and I get back only one cluster. Maybe I don't understand your draft.. I can extract each connected component, make the matrix for each of them and then apply clustering. When you say " You have to apply clustering to each connected component separately." : that's what I did, I applied clustering on the giant component and not on others because I don't want to apply clustering on the others (but I could do it)

ADD REPLY • link 4.9 years ago by pablo ▴ 300

0

Entering edit mode

Yes this is the matrix you should be using as input. I thought your data was already like this.

should I create a matrix with all the OTUXXX and UniRef90_XXX for both rows and columns?

No unless you want to consider OTUs and Unirefs as equivalent.

ADD REPLY • link 4.9 years ago by Jean-Karim Heriche 27k