I have RMA (Robust Multi-Array) scores for the different genes (and their isoforms) on the Affymetrix chip. I want to know which of these genes are "active" (or in other words: are likely to produce enough protein products to have an effect). I'm not interested in them being differentially expressed or X-fold over- or under-expressed. All I want is the classification of them being likely "on" or "off".
So far I log-transformed (basis 10) the RMA score and centered them (subtracted the median). I called all genes which had a transformed score <0 as being inactive and scores >0 as being active.
Does anyone have a better methodology ?
I would suggest the following question instead of the one you're asking:
Can you actually determine if a gene is "active" (i.e. translated into protein) from [gene] expression data?
And I'll point you towards people who have published papers about it:
These are just a few papers that seem critical towards such a correlation. That is not to say that there is no good correlation for any gene. But I would be very surprised if you can make a general rule about it without checking in every cell type, tissue type and for every gene to see if such a correlation is or not acceptable.
Now, if you do a Pubmed search for the terms "correlation mRNA protein", you will find many papers that check for such correlations, but mostly for specific genes in specific tissues (often for cancer diagnostics purposes).
If you do find papers that state such correlations, genome wide using microarray data, I'd be highly suspicious of that paper.
So, obviously, you can not set "a" cut-off for determining this. My personal experience tells me that you can have gene transcription with no protein expression following it... Unfortunately, I have not published it yet :(
You're right in thinking that your methodology isn't a very good representation of the system. mRNAs (and their protein products) have a huge dynamic range. Some are going to be expressed constantly at extremely low levels, and at the other extremes, you'll have genes that are highly expressed, but only for a short period of time. Taking the median level as the dividing line between on and off is going to give you huge numbers of false negatives (genes that are actually being transcribed and translated, but that you'll classify as "off")
I'd look at what the background noise level is, then run some stats to determine which probes give you signal significantly above that level. Any gene meeting that criteria should probably be considered "on". I suspect that may not divide the set as nicely as you'd hope, though.
Maybe if you tell us more about what exactly you're trying to do, we can offer more constructive advice.
Sounds like your trying to find genes which actually switch from on-to-off (or vice-versa) based on cell-type, condition, etc. Not all genes have this type of behavior ... some are graded (like a dimer switch). There are numerous papers that discuss techniques for finding genes which have "bi-modal" expression patterns. Since they are a mixture of two expressions patterns it is likely that they have "on" and "off" pattern.
This article explain the technique and includes Matlab code that should do the whole thing for you.