I would be very grateful for suggestions on how best to tackle this project. I want to find out which SLC transporter transcripts are most highly expressed in the normal human hippocampus. For this, I plan to use publicly available Affymetrix U133 Plus 2.0 microarray data from ArrayExpress/GEO. I will utilize the CEL files or normalized data for just the normal/control tissues. How can I combine data from different experiments/studies (all done on the U133 Plus 2.0 platform) to get the most reliable estimate and hierarchical list of transcipt abundance in normal human hippocampus? Thank you!
The most important aspect for your analysis is possibly to run the normalization and probe summarization again from the CEL files. Reasons for this are twofold: First, to use a consistent array-design description for all arrays during pre-processing, second, many normalization methods (eg. quantile-normalization) tend to scale the arrays in the context of all chips in the experiment. If you take the arrays out of context, and put them into a new one, the absolute values become meaningless. Thus I would recommend to collect all CEL files into a 'virtual experiment' and run normalization, summarization on them using the latest array description file (.adf).
Please check arrayexpress atlas: http://www.ebi.ac.uk/gxa/
It is a curated subset of arrayexpress where the curators think the studies are useful for the kind of comparisons you want to do.
If I remember correctly it also already provides re-normalized data using RMA, to make data as comparable as can be. But you will almost certainly need a statistical modelling approach that includes studies as a factor.
If all you need is a rank ordering of SLC transporter transcripts, you could try sorting the expression levels in each array separately and then replacing each expression level with its sort order in the array. Then your "expression level" for each gene would be its median (or mean) rank (i.e. sort order) across all of the hippocampal arrays.
The advantage of this approach is that you need not worry about making all the measurements comparable. If you have enough samples, I bet you'll get virtually the same answer as a re-normalization approach.
In addition to Michael and Chris' suggestions, you might also need to run ComBat.R to combine data from different labs together. See the answers to this question on combining gene expression from multiple arrays.