Question: Determining Which Genes Are Unexpressed In Cell Lines
gravatar for Andrew W
7.5 years ago by
Andrew W290
Andrew W290 wrote:

I would like to come up with a list of genes that are not expressed in certain cell lines (e.g. HeLa). I suppose it might be possible to come up with a list based on a priori considerations, but I would prefer to use expression data to justify the choice of genes. From what I have read, ~50% genes in any one cell line are not expressed, so presumably it should be possible to develop a high-confidence set (just how many genes, I am not yet sure- for now I would say ~100 genes)

One idea I'd had was to use the Affymetrix expression data from BioGPS (e.g. for HeLa, one of the cell lines in which I am interested). I thought to sort the mean (or median) of probe intensities for genes, and then take the first X number of genes as examples of those which are not expressed. One problem I noticed right away while implementing this is that the probe intensity values vary greatly for some genes. It was pointed out to me that comparing probe intensities within a sample can be problematic (e.g. intensities can vary due to secondary structure), and comparisons are most informative or reliable between samples.

In order to improve my search for this list of unexpressed genes, I am considering getting data from GEO for untreated samples from different experiments (also using different microarray platforms) that used my cell line of interest, and then finding a list of genes that are confirmed as being unexpressed in multiple samples.

Another idea was to use the data for all the cell lines covered by BioGPS and come up with a list of genes whose intensity values are lowest in my cell line of interest (ranked, for example, by difference between mean or median of probe intensities in cell line of interest and that of cell line whose mean/median is closest).

Is there a better way to do this? I have very little experience working with expression data, so any suggestions are greatly appreciated.

Many thanks in advance for your help,


gene data • 3.6k views
ADD COMMENTlink modified 7.5 years ago by David Quigley11k • written 7.5 years ago by Andrew W290

Thank you, everyone, for all the suggestions. I only stumbled across BioStar recently, and I'm very impressed. I've had questions in the past which went unasked, as I didn't think they were appropriate for Bioconductor or Bioperl. I'm very happy to have found this forum!

I will update with a comment when I have implemented a solution.

Thank you again for your help,


ADD REPLYlink written 7.5 years ago by Andrew W290
gravatar for David Quigley
7.5 years ago by
David Quigley11k
San Francisco
David Quigley11k wrote:

My usual heuristic for setting a threshold for expressed genes is to look at expression levels of genes on the Y chromosome for samples derived from females. Since these are guaranteed to be bogus signals, anything expressed at a comparable level is background. Works only if you have cell lines derived from females, but with HeLa you're in luck.

Remember that what you're really determining is "genes expressed below the threshold of detection for a microarray". More sensitive methods such as qPCR may identify transcripts even when a microarray cannot do so.

ADD COMMENTlink written 7.5 years ago by David Quigley11k

David Quigley I am applying this strategy to filter out "un-expressed" genes in my dataset. Do you know if there is any research paper out there implementing it?

ADD REPLYlink written 4.6 years ago by komal.rathi3.4k
gravatar for Larry_Parnell
7.5 years ago by
Boston, MA USA
Larry_Parnell16k wrote:

A similar idea was explored in a 2010 paper in which "housekeeping genes" were described that were not expressed in a given mouse tissue as those genes expressed in all tissues examined but one. See the paper by Thorrez, Schuit, et al 2010 Genome Res 21(1):95-105. I recall that nearly half of the ~1050 such genes they identified were not expressed in testis. Very different basic metabolism going on there.

They used gene expression data in ways that are similar to your plan. Rather than hypothesize about an approach, I suggest the methods described in that paper.

ADD COMMENTlink written 7.5 years ago by Larry_Parnell16k

Thank you for suggesting this paper ( I'm not sure that I will implement this approach, as I don't mind if the genes in my list are unexpressed in several cell lines. Using the Gene Expression Barcode or MAS5 calls will hopefully be sufficient.

ADD REPLYlink written 7.5 years ago by Andrew W290

Sure. I just think its good to see what other approaches are out there, to see what has worked.

ADD REPLYlink written 7.5 years ago by Larry_Parnell16k
gravatar for Sean Davis
7.5 years ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

You might take a look at this paper The Gene Expression Barcode: leveraging public data repositories to begin cataloging the human and murine transcriptomes that probably meets your needs pretty directly. There is an online tool (linked below) which you can upload to and process your own samples to generate your samples' gene expression barcodes:

ADD COMMENTlink modified 7.5 years ago • written 7.5 years ago by Sean Davis25k

Thank you for pointing me to this resource. One thing I notice is that their estimate for % of genes unexpressed is higher than the figure (50%) I've most often heard cited. From their paper (

"in our estimate of the human transcriptome, most genes were primarily off, and a small proportion primarily on, across cell types: 76% of genes were off in at least 80% of tissues"

ADD REPLYlink written 7.5 years ago by Andrew W290

I checked with one GEO sample (, and, if I've done things properly, only ~12% of the probes indicate expression. It's a binary call, so I don't see an easy way to reduce the list to a smaller number of high-confidence genes.

ADD REPLYlink written 7.5 years ago by Andrew W290

I think what I'll do for now is use the MAS5 p-values for absent calls to get a high-confidence list (I only need ~100 genes, so I can be picky). Using multiple samples and, where possible, applying thresholds based on Y-chr genes, and using Gene Expression Barcode for confirmation, should give me a good list.

ADD REPLYlink written 7.5 years ago by Andrew W290
gravatar for Gareth Palidwor
7.5 years ago by
Gareth Palidwor1.6k
Gareth Palidwor1.6k wrote:

Affy probeset intensities isn't a great indicator of associated genes being expressed or non-expressed.

Because affy probesets consist of multiple probes, there are various techniques that use this information to give an estimate of whether the given probeset is hybridizing.

I've used the Present/Marginal/Absent (based on p-value) calls for the MAS5 style expression analysis extensively with good results; these are based on the variability of the probes that constitute the probeset (check the affy stats docs for details

The newer exon and gene chips don't have paired mismatch probes in the probesets so there is another algorithm called "Detection Against Background" (DABG) which estimates the p-value of the probeset hypridization being based on a comparison of member probes agains global background probes having equivalent GC content. I've used this data bit but I haven't done any detailed review of it's effectiveness, and haven't really read up on it much so I can't vouch for it.

ADD COMMENTlink written 7.5 years ago by Gareth Palidwor1.6k

According to Affyemtrix, the DABG should not be used on gene level:

In one of their whitepapers, they suggest to find expressed genes by looking at the DABG of the exons of the gene and define it as expressed if a certain percentage of exons are expressed.

ADD REPLYlink written 6.7 years ago by Pascal250
gravatar for Duff
7.5 years ago by
United Kingdom
Duff660 wrote:

In the past I've used MAS5 absent calls to filter out 'unexpressed' genes. If you took a bunch of HeLa cell array studies, ran MAS5, got the 'absent' called genes and then overlapped the results from doing this a few times I think you might get a reasonable list.

ADD COMMENTlink written 7.5 years ago by Duff660
gravatar for Ying W
7.5 years ago by
Ying W3.9k
South San Francisco, CA
Ying W3.9k wrote:

Here are another two papers that discuss the identification of housekeeping genes:

Human housekeeping genes are compact.

Exploring the use of internal and externalcontrols for assessing microarray technical performance

ADD COMMENTlink written 7.5 years ago by Ying W3.9k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1638 users visited in the last hour