I have a GSE data file in csv file format containing fields such as: ID, adj.P.Val, P.Value, t, B, logFC, Gene.symbol, Gene.title. In which adj.P.Val, P.Value, t, B, logFC fields being numeric. What are the factors I need to consider if I want to cluster the data only on logFC using K-Means clustering algorithm ? And first of all is it feasible to perform clustering on GSE data files ? If yes, what should be the approach ? If not, what different kinds of analysis can be performed on such kind of datasets ?
That looks like the output from a statistical test on RNA-seq or microarray data, SWATH, etc.. You cannot run meaningful cluster analysis on it because it contains only a single condensate differential expression value. This dataset contains two groups: the "significant" and "non-significant" genes and these depend on your cutoff for adj.P.value (e.g. 0.05) and logFC (e.g. +-1). You can do a few things that are pretty much standard:
- Get the raw data and pre-process and cluster them, given there are more than 2 conditions or samples this might make sense, and maybe using only significant genes.
- Get more meaningful contrasts like this from similar experiments, that means, change your experimental design to accommodate a time-series, different stressors, multiple cell-lines, you name it
- Do an enrichment analysis, e.g. GO enrichment of the significantly differential genes
Maybe simply make a heatmap instead of k-means, because k-means output is not really great to visualize. As others have noted, it might be better to think about the experiment design and experimental question while planning the experiment. If you were simply given that file to toy around with k-mean, that is not a good start, and you should be able to find a much more suitable multivariate dataset.