Question: What kind of analysis is practically done on GSE data files?
gravatar for shahdhruv7
15 months ago by
shahdhruv70 wrote:

I have a GSE data file in csv file format containing fields such as: ID, adj.P.Val, P.Value, t, B, logFC, Gene.symbol, Gene.title. In which adj.P.Val, P.Value, t, B, logFC fields being numeric. What are the factors I need to consider if I want to cluster the data only on logFC using K-Means clustering algorithm ? And first of all is it feasible to perform clustering on GSE data files ? If yes, what should be the approach ? If not, what different kinds of analysis can be performed on such kind of datasets ?

gene-expression gse • 292 views
ADD COMMENTlink modified 15 months ago by Michael Dondrup48k • written 15 months ago by shahdhruv70

What question are you trying to address with this work? One doesn't just analyse data for the sake of analysing data.

ADD REPLYlink written 15 months ago by Jean-Karim Heriche24k

Your question is unspecific. If you want to do kmeans then please read a tutorial and then ask specific questions. People are typically happy to help debugging your code or advise you towards specific problems but reluctant with spoon-feeding. Therefore, please first invest some effort into getting a background and then come back with specific questions.

ADD REPLYlink modified 15 months ago • written 15 months ago by ATpoint46k
gravatar for Michael Dondrup
15 months ago by
Bergen, Norway
Michael Dondrup48k wrote:

That looks like the output from a statistical test on RNA-seq or microarray data, SWATH, etc.. You cannot run meaningful cluster analysis on it because it contains only a single condensate differential expression value. This dataset contains two groups: the "significant" and "non-significant" genes and these depend on your cutoff for adj.P.value (e.g. 0.05) and logFC (e.g. +-1). You can do a few things that are pretty much standard:

  • Get the raw data and pre-process and cluster them, given there are more than 2 conditions or samples this might make sense, and maybe using only significant genes.
  • Get more meaningful contrasts like this from similar experiments, that means, change your experimental design to accommodate a time-series, different stressors, multiple cell-lines, you name it
  • Do an enrichment analysis, e.g. GO enrichment of the significantly differential genes

Maybe simply make a heatmap instead of k-means, because k-means output is not really great to visualize. As others have noted, it might be better to think about the experiment design and experimental question while planning the experiment. If you were simply given that file to toy around with k-mean, that is not a good start, and you should be able to find a much more suitable multivariate dataset.

ADD COMMENTlink modified 15 months ago • written 15 months ago by Michael Dondrup48k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1578 users visited in the last hour