Question: R + Bioconductor : Combining Probesets In An Expressionset
gravatar for Mike Dewar
10.3 years ago by
Mike Dewar1.6k
Columbia University, NYC, USA
Mike Dewar1.6k wrote:


Here's what I have:

GDS = getGEO('GDS785')
cd4T = GDS2eSet(GDS)
cd4T <- cd4T[!fData(cd4T)$symbol == "",]

Now cd4T is an ExpressionSet object which wraps a big matrix with 19794 rows (probesets) and 15 columns (samples). The final line gets rid of all probesets that do not have corresponding gene symbols. Now the trouble is that most genes in this set are assigned to more than one probeset. You can see this by doing

gene_symbols = factor(fData(cd4T)$Gene.symbol)
[1] 6897

So only 6897 of my 19794 probesets have unique probeset -> gene mappings. I'd like to somehow combine the expression levels of each probeset associated with each gene. I don't care much about the actual probe id for each probe. I'd like very much to end up with an ExpressionSet containing the merged information as all of my downstream analysis is designed to work with this class.

I think I can write some code that will do this by hand, and make a new expression set from scratch. However, I'm assuming this can't be a new problem and that code exists to do it, using a statistically sound method to combine the gene expression levels. I'm guessing there's a proper name for this also but my googles aren't showing up much of use. Can anyone help?

ADD COMMENTlink modified 23 months ago by RamRS28k • written 10.3 years ago by Mike Dewar1.6k

OK first question is why do you want to combine the expression levels of multiple probesets to one gene? I have to say with Affy data I almost exclusively work at the probeset level, and I'd imagine most other people do. There's a lot of information in those probesets - and you might not want to be chucking it away right from the outset..

ADD REPLYlink written 10.3 years ago by Daniel Swan13k

That's the way I would go about it. The problem is that probesets (especially from a chip like U133A which I think you're analysing) were designed to different builds of the underlying genome. Some probesets match multiple genes/transcripts/splice variants, some are misannotated etc. Best to work out which probesets are differentially expressed, then worry about disambiguating the gene level stuff at the end. Not to say that someone won't provide an answer to your problem however... :)

ADD REPLYlink written 10.3 years ago by Daniel Swan13k

I guess because this is how my limited understanding works! I'm looking for differentially expressed /genes/ one way or the other. Maybe I should be looking at differentially expressed probesets, then worry about which genes these probesets are associated with at the end of the analysis, rather at the start? This being the standard approach would explain my failure googling....

ADD REPLYlink written 10.3 years ago by Mike Dewar1.6k

Similar questions related to probesets here : Please take a look at Iam simpson's suggestions on dealing with differential expression hits based on different probes of same genes.

ADD REPLYlink modified 10 months ago by RamRS28k • written 10.3 years ago by Khader Shameer18k
gravatar for Nathan Harmston
10.3 years ago by
Nathan Harmston1.1k
Nathan Harmston1.1k wrote:

So typically this is not done. You would lose a lot of information from doing this, I mean you could take a geometric mean (the probesets of some gene expression data I had showed a log-normal distribution)...not the best.

Typically you do want to reduce the number of probesets you keep in your analysis (to reduce the number of tests you make (effecting any fdr estimates) you could do this by only selecting one probeset per gene using some measure of dispersion such as median absolute deviation (MAD) or interquartile range (IQR) and keeping the probeset which has the most variability/spread to be representative for that gene (MAD is better IMO)......although this as a sideline means you may actually be looking at the probeset which is subject to the most may also want to remove probesets where the majority of its component probes map to multiple locations in the genome (probably leading to dodgy and unreliable results), maybe using SCAMPA: or which contain g-spots/g-stacks :

But then what part of the gene the probeset maps to is important, exons or introns. Probes which map to different exons may show big differences:

and some people have suggested that it is useful to map probes and probesets to transcripts rather than genes:

Hopefully this will give you some ideas what to do with your probesets and reduce the number of them.

ADD COMMENTlink written 10.3 years ago by Nathan Harmston1.1k

Thanks so much. Looks like things are (as always) more complicated than they seem! Will print off some papers and head for a coffee! Chers!

ADD REPLYlink written 10.3 years ago by Mike Dewar1.6k
gravatar for Geoffjentry
10.3 years ago by
Geoffjentry320 wrote:

So I said something completely different over on SO (, but reading through the comments one thing that came to mind was that there exist alternate annotations for affy chips, which end up producing a single gene per probeset (in some cases) which has some evidence towards being a valuable thing to do:

ADD COMMENTlink written 10.3 years ago by Geoffjentry320
gravatar for Ekta Jain
8.5 years ago by
Ekta Jain10
Ekta Jain10 wrote:

Hello, LIMMA in R can give you a list of differentially expressed genes. LIMMA averages the expression of multiple probesets. I do not know how to simply use the probesets with highest signal intensity.

Hope this helps.


ADD COMMENTlink written 8.5 years ago by Ekta Jain10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 973 users visited in the last hour