Determining Which Genes Are Unexpressed In Cell Lines
6
3
Entering edit mode
12.6 years ago
Andrew W ▴ 290

I would like to come up with a list of genes that are not expressed in certain cell lines (e.g. HeLa). I suppose it might be possible to come up with a list based on a priori considerations, but I would prefer to use expression data to justify the choice of genes. From what I have read, ~50% genes in any one cell line are not expressed, so presumably it should be possible to develop a high-confidence set (just how many genes, I am not yet sure- for now I would say ~100 genes)

One idea I'd had was to use the Affymetrix expression data from BioGPS (e.g. for HeLa, one of the cell lines in which I am interested). I thought to sort the mean (or median) of probe intensities for genes, and then take the first X number of genes as examples of those which are not expressed. One problem I noticed right away while implementing this is that the probe intensity values vary greatly for some genes. It was pointed out to me that comparing probe intensities within a sample can be problematic (e.g. intensities can vary due to secondary structure), and comparisons are most informative or reliable between samples.

In order to improve my search for this list of unexpressed genes, I am considering getting data from GEO for untreated samples from different experiments (also using different microarray platforms) that used my cell line of interest, and then finding a list of genes that are confirmed as being unexpressed in multiple samples.

Another idea was to use the data for all the cell lines covered by BioGPS and come up with a list of genes whose intensity values are lowest in my cell line of interest (ranked, for example, by difference between mean or median of probe intensities in cell line of interest and that of cell line whose mean/median is closest).

Is there a better way to do this? I have very little experience working with expression data, so any suggestions are greatly appreciated.

Many thanks in advance for your help,

Andrew

gene data • 5.6k views
ADD COMMENT
0
Entering edit mode

Thank you, everyone, for all the suggestions. I only stumbled across BioStar recently, and I'm very impressed. I've had questions in the past which went unasked, as I didn't think they were appropriate for Bioconductor or Bioperl. I'm very happy to have found this forum!

I will update with a comment when I have implemented a solution.

Thank you again for your help,

Andrew

ADD REPLY
2
Entering edit mode
12.6 years ago

My usual heuristic for setting a threshold for expressed genes is to look at expression levels of genes on the Y chromosome for samples derived from females. Since these are guaranteed to be bogus signals, anything expressed at a comparable level is background. Works only if you have cell lines derived from females, but with HeLa you're in luck.

Remember that what you're really determining is "genes expressed below the threshold of detection for a microarray". More sensitive methods such as qPCR may identify transcripts even when a microarray cannot do so.

ADD COMMENT
0
Entering edit mode

David Quigley I am applying this strategy to filter out "un-expressed" genes in my dataset. Do you know if there is any research paper out there implementing it?

ADD REPLY
1
Entering edit mode
12.6 years ago

A similar idea was explored in a 2010 paper in which "housekeeping genes" were described that were not expressed in a given mouse tissue as those genes expressed in all tissues examined but one. See the paper by Thorrez, Schuit, et al 2010 Genome Res 21(1):95-105. I recall that nearly half of the ~1050 such genes they identified were not expressed in testis. Very different basic metabolism going on there.

They used gene expression data in ways that are similar to your plan. Rather than hypothesize about an approach, I suggest the methods described in that paper.

ADD COMMENT
0
Entering edit mode

Thank you for suggesting this paper (http://genome.cshlp.org/content/21/1/95.full). I'm not sure that I will implement this approach, as I don't mind if the genes in my list are unexpressed in several cell lines. Using the Gene Expression Barcode or MAS5 calls will hopefully be sufficient.

ADD REPLY
0
Entering edit mode

Sure. I just think its good to see what other approaches are out there, to see what has worked.

ADD REPLY
1
Entering edit mode
12.6 years ago

You might take a look at this paper The Gene Expression Barcode: leveraging public data repositories to begin cataloging the human and murine transcriptomes that probably meets your needs pretty directly. There is an online tool (linked below) which you can upload to and process your own samples to generate your samples' gene expression barcodes:

http://rafalab.jhsph.edu/barcode/index.php?page=sample_process

ADD COMMENT
0
Entering edit mode

Thank you for pointing me to this resource. One thing I notice is that their estimate for % of genes unexpressed is higher than the figure (50%) I've most often heard cited. From their paper (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3013751/?tool=pubmed):

"in our estimate of the human transcriptome, most genes were primarily off, and a small proportion primarily on, across cell types: 76% of genes were off in at least 80% of tissues"

ADD REPLY
0
Entering edit mode

I checked with one GEO sample (http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM161670), and, if I've done things properly, only ~12% of the probes indicate expression. It's a binary call, so I don't see an easy way to reduce the list to a smaller number of high-confidence genes.

ADD REPLY
0
Entering edit mode

I think what I'll do for now is use the MAS5 p-values for absent calls to get a high-confidence list (I only need ~100 genes, so I can be picky). Using multiple samples and, where possible, applying thresholds based on Y-chr genes, and using Gene Expression Barcode for confirmation, should give me a good list.

ADD REPLY
0
Entering edit mode
12.6 years ago
Gareth Palidwor ★ 1.6k

Affy probeset intensities isn't a great indicator of associated genes being expressed or non-expressed.

Because affy probesets consist of multiple probes, there are various techniques that use this information to give an estimate of whether the given probeset is hybridizing.

I've used the Present/Marginal/Absent (based on p-value) calls for the MAS5 style expression analysis extensively with good results; these are based on the variability of the probes that constitute the probeset (check the affy stats docs for details http://www.affymetrix.com/support/technical/whitepapers/sadd_whitepaper.pdf).

The newer exon and gene chips don't have paired mismatch probes in the probesets so there is another algorithm called "Detection Against Background" (DABG) which estimates the p-value of the probeset hypridization being based on a comparison of member probes agains global background probes having equivalent GC content. I've used this data bit but I haven't done any detailed review of it's effectiveness, and haven't really read up on it much so I can't vouch for it.

ADD COMMENT
0
Entering edit mode

According to Affyemtrix, the DABG should not be used on gene level: https://stat.ethz.ch/pipermail/bioconductor/2010-September/035475.html

In one of their whitepapers, they suggest to find expressed genes by looking at the DABG of the exons of the gene and define it as expressed if a certain percentage of exons are expressed.

ADD REPLY
0
Entering edit mode
12.6 years ago
Duff ▴ 670

In the past I've used MAS5 absent calls to filter out 'unexpressed' genes. If you took a bunch of HeLa cell array studies, ran MAS5, got the 'absent' called genes and then overlapped the results from doing this a few times I think you might get a reasonable list.

ADD COMMENT
0
Entering edit mode
12.6 years ago
Ying W ★ 4.2k

Here are another two papers that discuss the identification of housekeeping genes:

Human housekeeping genes are compact.

Exploring the use of internal and externalcontrols for assessing microarray technical performance

ADD COMMENT

Login before adding your answer.

Traffic: 1563 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6