Question: Hypergeometric Test R/Bioconductor Result Categorization
2
gravatar for Thaman
9.0 years ago by
Thaman3.2k
Finland
Thaman3.2k wrote:

Hi,

This problem is in R/Bioconductor, topGO and GOstats packages. I want to perform HyperGeometric test for OVER representation against GO and KEGG. I have go two text files locally available: Back.txt (background with only entrez id and more then 2000 ID's- Illumina HT12 v3. microarray) and **genes.txt** (differently expressed genes- Illumina HT12 v3. microarray). The result should be in R data.frame with following fileds (GO_term_id, Go_term_name, p_value and number of associated genes from my genes list(Genes.txt file)) and visualizing in the Gograph or KEGG pathway.

I have perform hyperGTest and got the result but I am not sure whether the generated result is correct or not as per my OVER represented need against GO and KEGG. How can I check my result with other databases for consistency? I have tried with DAVID but couldn't figure out.

I have following code

 library(topGO)

 library(GOstats)

 universe=read.table("Back.txt", sep=",")  # Background files where only entrez id's are listed without heading column

 tbl <- read.table ("genes.txt", sep=",")  # selected genes with following header Probes_id,entrez_gene_id,symbols,P.Value and F.C

 selected=<-tbl$V2  # Selecting only second column of tbl vector where entrez_gene_id is present


 param<- new ("GOHyperGParams", geneIds = selected,

 universeGeneIds=universe, annotation="org.Hs.eg.db",

 ontology="BP",pvalueCutoff=0.1, conditional=FALSE,testDirection="over")

 hyp<-hyperGTest(param)

 summary(hyp)

For the trial I wrote summary(hyp) which gives me GOBPID, Pvalue, OddsRatio,ExpCount,Count, Size and Term.

But I want to get result in data.frame with following fields consisting GO_term_id, Go_term_name, p_value and number of associated genes from my genes list(Genes.txt file)

I have provided the genes.txt sample link for convinence and Back.txt contains only entrez_id

*Note- This is a follow up question.

gene R bioconductor kegg enrichment • 6.3k views
ADD COMMENTlink modified 8.8 years ago by Brad Chapman9.4k • written 9.0 years ago by Thaman3.2k
1

Your question is unclear. You appear to be doing your test against the Biological Process GO field in the annotation db. So why would this return the KEGG pathway in any subsequent data frame? If you want to extract the information from the annotation database to integrate with the data, then you need to learn to look stuff up in the annotation package. The data summary(hyp) returns all the relevant information for your query, just as analysing it in DAVID would.

ADD REPLYlink written 9.0 years ago by Daniel Swan13k
1

Hi Thaman, there are only few possible reason for not getting an answer: nobody knows, nobody cares, or nobody understood the question. Mainly here its a mixture, caused by point # three. People have asked you repeatedly for a reproducible example. Something that I could paste in my R session and run it. Make the sample data downloadable (all your inputs required + genes.txt) and even ppl not totally familiar with the package might help you out. I guess I could help but if you keep ignoring comments..., so please stop whining and re-opening and put an reproducible example

ADD REPLYlink written 9.0 years ago by Michael Dondrup46k

My problem with this question is that it seems to be asking about 3 different things. It's very unclear what you are trying to achieve, particularly without sample data.

ADD REPLYlink written 9.0 years ago by Neilfws48k

@ Daniel- You are right about the confusion, but I have mentioned this in my previous question. I have two options either to do it from GO or KEGG so that's why result also changes according to annotation packages I have been using. I tried to look GOstats and topGO but I din't find extracting specific information in result sections like my requirement (GO_term_id, Go_term_name, p_value and number of associated genes from my genes list(Genes.txt file)).

@ Micheal and newilfws- Okei I will edit questions again and make it more precise with sample data.

ADD REPLYlink written 9.0 years ago by Thaman3.2k

@ Daniel- You are right about the confusion, but I have mentioned this in my previous question. I have two options either to do it from GO or KEGG so that's why result also changes according to annotation packages I have been using. I tried to look GOstats and topGO but I din't find extracting specific information in result sections like my requirement (GO_term_id, Go_term_name, p_value and number of associated genes from my genes list(Genes.txt file)).

ADD REPLYlink written 9.0 years ago by Thaman3.2k

@ Micheal and newilfws- Okei I will edit questions again and make it more precise with sample data.

ADD REPLYlink written 9.0 years ago by Thaman3.2k

@ Daniel, Michael, neilfws- I have edited my questions in the most precise way. Hope now it's clear

ADD REPLYlink written 9.0 years ago by Thaman3.2k

It's highly appreciated :)

ADD REPLYlink written 9.0 years ago by Michael Dondrup46k

@ Michael, so will you guide me how to solve my probelm now?

ADD REPLYlink written 9.0 years ago by Thaman3.2k

If Brad hadn't been faster, I would, let us know if that solves your problem.

ADD REPLYlink written 9.0 years ago by Michael Dondrup46k
11
gravatar for Brad Chapman
9.0 years ago by
Brad Chapman9.4k
Boston, MA
Brad Chapman9.4k wrote:

Here is the code, with comments, to produce the output table with gene names. The manipulations are broken down into step by step to try and make them as clear as possible. The two key bits are:

  • Use the human annotation table to extract Entrez gene IDs linked to your GO identifiers.
  • Subset your list of selected genes based on that list of Entrez IDs.

This is applied to each row of the table, then pushed back into the final data table for output:

library(GOstats)

#universe <- read.table("Back.txt")
inGenes <- read.table ("genes.txt", sep="\t", header=TRUE)
selected <- unique(inGenes$Entrez_Gene_ID)

param <- new("GOHyperGParams", geneIds=selected,
             #universe=universe,
             annotation="org.Hs.eg.db", ontology="BP",pvalueCutoff=0.1,
             conditional=FALSE, testDirection="over")
hyp <- hyperGTest(param)
sumTable <- summary(hyp)
# subset the output table to get the columns of interest 
# (GO ID, GO description, p-value)
out <- subset(sumTable, select=c(1, 7, 2))
# retrieve input genes associated with each GO identifier
# use the org.Hs.eg data mapping to get GO terms for each ID
goMaps <- lapply(out$GOBPID, function(x) unlist(mget(x, org.Hs.egGO2ALLEGS)))
# subset the selected genes based on those in the mappings
goSelected <- lapply(goMaps, function(x) selected[selected %in% x])
# join together with a semicolon to make up the last column
out$inGenes <- unlist(lapply(goSelected, function(x) paste(x, collapse=";")))
# write the final data table as a tab separated file 
write.table(out, file="go_results.tsv", sep="\t", row.names=FALSE)
ADD COMMENTlink modified 6 weeks ago by RamRS24k • written 9.0 years ago by Brad Chapman9.4k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2745 users visited in the last hour