Question: R: Topgo Usage Question
13
gravatar for Mike Dewar
8.9 years ago by
Mike Dewar1.5k
Columbia University, NYC, USA
Mike Dewar1.5k wrote:

I'm struggling using topGO to do some GO enrichment. Apologies in advance for the trivial nature of this question: I must be misunderstanding something.

I have 200 probesets, and I'd like to see if they're significantly represented in any GO terms. The names of the probesets are in a character vector called interesting_genes. I have an ExpressionSet object called exprset which contains all the data from the microarray experiment I'm working on.

I'd like to use the R package topGO. From one of Khader's answers to a previous post, I know that I should be able to use this package to perform enrichment without having to pass in data of any kind (indeed, most of the web-tools I can use for this only require a list of genes, and to select a background set). It's just I'm having trouble persuading the API to let me do what I want, and I'm having trouble interpreting the errors that result. Here is my code:

genes = factor(as.integer(rownames(exprs(exprset)) %in% interesting_genes))
names(genes) <- rownames(exprs(exprset))

all_genes = factor(rep(0,nrow(exprset))
names(all_genes) <- rownames(exprs(exprset))
levels(all_genes) <- levels(genes)

GOdata <- new("topGOdata", description = "getGO", ontology = "BP", 
    allGenes = all_genes, geneSel = genes, nodeSize = 10, 
    GO2gene = list(mogene10sttranscriptclusterGO2PROBE), annot = annFUN.GO2genes
)

Note the awful construction of all_genes - this must be wrong. But topGO requires a "named object of type numeric of factor", and subsequently demands two factor levels, even though all that really makes sense to pass, in this simple approach to enrichment, is a list of names.

Currently the error I'm getting is

Error in 
if is.na(index) || index < 0 || index > length(nd)) stop(paste("selected vertex",  : 
  missing value where TRUE/FALSE needed

but I don't really know how to interpret this. I'm sure it has something to do with the way I'm forming the object. So my question is:

How do I form a topGO object without using any expression data or scoring information? Specifically, what am I misunderstanding here?

gene R bioconductor microarray • 14k views
ADD COMMENTlink modified 4.8 years ago by moonlughts0 • written 8.9 years ago by Mike Dewar1.5k

just after I asked this question I realised something about how to pass in the annotations. So have edited the question a bit to cure that particular tidbit of naivete. Still, stuck, though.

ADD REPLYlink written 8.9 years ago by Mike Dewar1.5k

Can you post few ids from your list ?

ADD REPLYlink written 8.9 years ago by Khader Shameer18k

@Khader - yep! "10351026" "10463751" "10537347" "10602176" "10426648" "10462752" ...

ADD REPLYlink written 8.9 years ago by Mike Dewar1.5k

Despite the two helpful answers below, I still can't get this guy to work. It's starting to feel more like a Bioconductor list question now, though.

ADD REPLYlink written 8.9 years ago by Mike Dewar1.5k

Despite the two helpful answers from Brad and lGautier below, I still can't get this guy to work. It's starting to feel more like a Bioconductor list question now, though.

ADD REPLYlink written 8.9 years ago by Mike Dewar1.5k

If you can map your probe ids to genes, then you can easily do the analysis with out expression data or scoring information : see this link for a working code - http://bit.ly/bk6Ylp (posted by Chuangye earlier)

ADD REPLYlink written 8.9 years ago by Khader Shameer18k

@Khader - thanks! That provided the clues I needed. It turns out that I was maybe making life a bit more complicated than necessary. Shall I put what I did in the end as answer, or just delete the question? I guess this turned out to be a bit more of a Bioconductor mailing list question...

ADD REPLYlink written 8.9 years ago by Mike Dewar1.5k

Please don't delete the question. As you already posted a final solution, let this remain as a source for future reference on enrichment with out p-values.

ADD REPLYlink written 8.9 years ago by Khader Shameer18k
13
gravatar for Mike Dewar
8.9 years ago by
Mike Dewar1.5k
Columbia University, NYC, USA
Mike Dewar1.5k wrote:

By cobbling together both responses, and the link suggested by Khader, it turns out that this is all I need:

library(topGO)
library(mogene10sttranscriptcluster.db)
# exprset is an Expression Set object
# interesting_genes are the probesets I'd like to perform GO Enrichment on
# first pull out all the probesets
all_genes <- rownames(exprs(exprset))
# then make a factor that is 1 if the probeset is "interesting" and 0 otherwise
geneList <- factor(as.integer (all_genes %in% interesting_genes))
# name the factor with the probeset names
names (geneList) <- allGenes
# form the GOdata object
GOdata <-new ("topGOdata", 
    ontology = "BP", 
    allGenes = geneList, 
    nodeSize = 5, 
    # annot, tells topGO to map from GO terms to "genes"
    annot = annFUN.GO2genes,
    # so annot then calls something to perform this mapping GO2genes, 
    # which is this from the mogene... library
    GO2genes = as.list(mogene10sttranscriptclusterGO2PROBE)
)

So, things are bit simpler than I thought, and the main confusion seemed to be about how to specify allGenes and geneSel. I guess this is a very basic, and maybe not that standard, use of topGO!

ADD COMMENTlink modified 8.9 years ago • written 8.9 years ago by Mike Dewar1.5k
3

I didn't want to complain, but I think maybe the topGO documentation is a little confusing. Following the example mentioned by Khader, I've constructed allGenes as a factor, indicating which genes I'm interested in, and it seems to work. The subsequent Fisher test seems to be giving me sensible results, and hence I'm loathe to mess with it. Especially as I thought this was going to be a 10 minute job, three days ago. But you're right - the responsible thing to do would be to check with the bioconductor mailing list...

ADD REPLYlink written 8.9 years ago by Mike Dewar1.5k
1

Thanks for your question + answer Mike :-) I was having the same problems as you were having, solved now :-)

ADD REPLYlink written 6.2 years ago by Irsan6.9k

Not sure of what your are getting back. The parameter "allGenes" seems to have to be a named numerical vector (try running in R example("topGOdata-class") ). At the same time the man page mentions that "allGenes" should be a vector of strings :/. I'd bump that to the bioconductor mailing list.

ADD REPLYlink written 8.9 years ago by Laurent Gautier810
7
gravatar for Brad Chapman
8.9 years ago by
Brad Chapman9.4k
Boston, MA
Brad Chapman9.4k wrote:

allGenes should be a named vector where the names are the IDs and the values are p-values, or some other value to help determine the selected and non-selected genes:

> vec <- c(1, 0, 1)
> names(vec) <- c("id1", "id2", "id3")
> vec
id1 id2 id3 
  1   0   1

Then geneSel is a function that uses a p-value to decide if a name is selected or not. For instance, if you wanted values of 0 to be selected, and other not:

> selector <- function(theScore) {
+ return (theScore == 0)}
> selector(0)
[1] TRUE
> selector(1)
[1] FALSE

A while back, I wrote up some Python, Rpy2 and R code using topGO that goes into a more complete start to end example:

http://bcbio.wordpress.com/2009/10/18/gene-ontology-analysis-with-python-and-bioconductor/

GOstats uses universe and test lists of IDs and is worth looking at as well:

http://www.bioconductor.org/packages/release/bioc/html/GOstats.html

ADD COMMENTlink written 8.9 years ago by Brad Chapman9.4k

Thanks! I hadn't realised that geneSel should be a function.

ADD REPLYlink written 8.9 years ago by Mike Dewar1.5k
2
gravatar for Laurent Gautier
8.9 years ago by
Laurent Gautier810 wrote:

Giving standalone examples is generally better as it lets one ensure that :

  • the problem is reproducible

  • lowers the overhead for others to reproduce it

(and to make readers feel like they are on the bioconductor mailing-list, giving the sessionInfo() should be mandatory).

The parameter "geneSel" expands to "geneSelectionFunction" and should be a function.

Beside that, the construction of "all_genes" does indeed seem a little daring:

library(Biobase)
data(sample.ExpressionSet)

# having all to zero might cause problems afterwards btw
all_genes <- rep(0, nrow(sample.ExpressionSet))
names(all_genes) <- featureNames(sample.ExpressionSet)
ADD COMMENTlink written 8.9 years ago by Laurent Gautier810

Thanks for the response. Yeah, I agree with the reproducible problem thing, if this were a programming question, which I guess is how I've written it up. I should have probably asked this as a "what do these bits of the topGO API mean" question but the above seemed clearer at the time. And if the answer is something to do with my sessionInfo then I don't want to be using the package! I'm nervous of any function that will allow my sessionInfo to mess it up. I shall make geneSel into a function - I hadn't picked that up from the documentation at all!

ADD REPLYlink written 8.9 years ago by Mike Dewar1.5k

Also that I used length(rownames(exprs(exprset))) instead of nrow(exprset) is slightly embarrassing. Am going to edit my OP and hope no-one reads these comments... The variable all_genes does seem to have to be a factor though, for some reason. Or, at least, that's the error I'm getting right now.

ADD REPLYlink written 8.9 years ago by Mike Dewar1.5k

I take this back - about all_genes having to be a factor. Though it's unhappy when all_genes is all zeros (or all ones), if I pass it some real values it's fine.

ADD REPLYlink written 8.9 years ago by Mike Dewar1.5k

I take this back - about all_genes having to be a factor. Though it's unhappy when all_genes is all zeros (or all ones), if I pass it some real values it's fine. And when I say fine, I don't mean that it starts working, it just doesn't complain about factors.

ADD REPLYlink written 8.9 years ago by Mike Dewar1.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1023 users visited in the last hour