I have been looking at the relationship between log transformed estimated RNA-seq counts and microarray (HG-U133a) gene expression. Luckily, TCGA has both kinds of data for several patients. I began by comparing the preprocessed data sets from TCGA which are already on the gene level. While pleased with the results for the most part, I decided to download the micro array CEL files and RMA process them so that I could look at the probe set level. Interestingly, some of the probe sets have very different distributions, but are mapped to the same gene.
I am curious why this happens. My first thought was that this has to do with the suffixes of the probe set IDs. I've found information about what the suffixes mean from Affymetrix's webpage. I'm a little confused by what they mean. To be more specific:
"_at = all the probes hit one known transcript.
_a = all probes in the set hit alternate transcripts from the same gene
_s = all probes in the set hit transcripts from different genes
...
For HG-U133, the _a designation was not used; an _s probe set on these arrays means the same as an _a on any of the HG-U133 arrays. "
This quote is from http://www.affymetrix.com/estore/support/help/IVT_glossary/index.affx, and I'm assuming that the mention of HG-U133 includes HG-U133a. The last sentence is the confusing part. Is it saying that an _s probe on HG-U133 array means the same as an _a probe for the arrays that actually have _a probes?
I suppose my main question is, if I see very different distributions of two probe sets that map to the same gene, what could that mean? If they are measuring the expression of different isoforms/transcripts of the same gene, how can I find out which ones each probe set is measuring?
Thanks for any insight.