I would like to come up with a list of genes that are not expressed in certain cell lines (e.g. HeLa). I suppose it might be possible to come up with a list based on a priori considerations, but I would prefer to use expression data to justify the choice of genes. From what I have read, ~50% genes in any one cell line are not expressed, so presumably it should be possible to develop a high-confidence set (just how many genes, I am not yet sure- for now I would say ~100 genes)
One idea I'd had was to use the Affymetrix expression data from BioGPS (e.g. for HeLa, one of the cell lines in which I am interested). I thought to sort the mean (or median) of probe intensities for genes, and then take the first X number of genes as examples of those which are not expressed. One problem I noticed right away while implementing this is that the probe intensity values vary greatly for some genes. It was pointed out to me that comparing probe intensities within a sample can be problematic (e.g. intensities can vary due to secondary structure), and comparisons are most informative or reliable between samples.
In order to improve my search for this list of unexpressed genes, I am considering getting data from GEO for untreated samples from different experiments (also using different microarray platforms) that used my cell line of interest, and then finding a list of genes that are confirmed as being unexpressed in multiple samples.
Another idea was to use the data for all the cell lines covered by BioGPS and come up with a list of genes whose intensity values are lowest in my cell line of interest (ranked, for example, by difference between mean or median of probe intensities in cell line of interest and that of cell line whose mean/median is closest).
Is there a better way to do this? I have very little experience working with expression data, so any suggestions are greatly appreciated.
Many thanks in advance for your help,