I have a question about mRNA expression datasets which I was hoping someone could answer. I'm interested in obtaining mRNA expression values across multiple cancer cell lines. I was looking at a dataset provided by the Cancer Cell Line Encyclopedia which describes a dataset as "Gene-centric RMA-normalized mRNA expression data (ENTREZG v15 CDF) and that "Either the original Affymetrix U133+2 CDF file or a redefined custom CDF file (ENTREZG - v15) was used for the summarization." In some datasets I've come across often multiple probes are shown for a single gene and each probe has a different expression value, whereas with this gene-centric custom CDF file there appears to be one expression for one specific gene (i.e. no multiple probes).
My question is this: how are these expression values generated for the redefined custom CDF file? Why is there one expression value for a specific gene and not expression values from multiple probes for the same gene? Would using the redefined custom CDF files be appropriate for downstream analysis such as differential gene expression?
Any help would be greatly appreciated! Thanks!