9.4 years ago by
European Union
Probe sets change in every version of every platform. As the time goes on they enhance their genome coverage. In your analysis keep every probe that passes your quality control thresholds.
You should face 2 issues: i) how to get comparable datasets, ii) how to analyze them.
i) GEO is a mine, but pooling different expression datasets together is a risky task in term of bias. I was impressed by a presentation of S.Bicciato's work on GEO data and I suggest you to look at "Novel definition files for human GeneChips based on GeneAnnot"[PMID:18005434] and "Strategies for comparing gene expression profiles from different microarray platforms: application to a case-control experiment"[PMID:16624241] as starting point if you want to go this way. Otherwise you could turn to a meta-analysis approach, avoiding the bias of merging data, but I imagine this could open a new thread in the blog...
ii) If your analysis is referred to a status variable (e.g. treated vs untreated) it is better to test for differentially expressed probes/genes (look at the good documentation coming with the limma package) and then you could represent your results with a heatmap (this step implies correlations).
If you just want to create a gene profile across your samples cor can do the job of measuring the "distance" between each pair of probes/genes, but then you should cluster your results in some way. Otherwise the whole job could be done by some pre-packaged implementations of k-means or hierarchical clustering or many others.
What is the biological question you want to answer? I assume that you want to find the coexpression of genes for some reason and in some biologic context. What is the reason and the biologic context?