Dear Biostars Community,
i would like to implement and perform in R a fisher's exact test for gene-set enrichment analysis, with the goal of testing for overepresentation of DE genes, in various gene-sets based on a pertubagen. For my purpose, i used the following R script (small part of the code):
part of the code which tests the independence of upregulated user input DE symbols, and upregulated gene-sets of drug pertubagen experiments:
....
score.matrix.total <- matrix(NA, nrow=1, ncol=33132) # **a matrix that will save the fisher's p values below**
dimnames(score.matrix.total)[[2]] <- as.character(names(list.drug.experiments.up)) # **the name of each of the experiments with upregulated genes**
for (i in 1:length(list.drug.experiments.up)) {
mat <- matrix(c(length(background.universe.symbols)-length(union(genes.up,list.L1000.up[[i]])),
length(setdiff(genes.up, list.L1000.up[i])),
length(setdiff(list.L1000.up[[i]],genes.up)),
length(intersect(genes.up,list.L1000.up[[i]]))
),
nrow=2)
fr <- fisher.test(mat,alternative = "two.sided")
score.matrix.total[,i] <- fr$p.value
}
dat <-as.data.frame(score.matrix.total)
dat2 <- stack(dat)....
For the clarification of the above code:
**genes.up=the upregulated user's gene symbols
list.L1000.up= an R object list that contains the upregulated genes for each pertubagen experiment
list.drug.experiments.up=a character vector of the same length as list.L1000.up, which contains the names of all experiments.**
background.universe.symbols= the total gene symbols used in the experiment
Thus, my questions are the following:
1) Concerning the construction of the needed contigency table: is my approach correct, about the creation of the contigency table, as an input for the fisher's exact test ? or there is a more simple implementation ?
2) Regarding the fisher's exact test function: "two.sided" option is the most safe, for two-sided tests ? or in this case, as i search for over-representation, and thus enrichment, the alternative option "greater" would also be more suitable ?
I also perform the same code for the relative down-regulated input genes, and the associated downregulated gene-sets experiments, for each of the same drugs.
I find your code difficult to understand, starting from the first line: Why is there only one row in the
score.matrix.total
matrix? What does the 33132 columns correspond to?Also, a small example of the expected input/output would be helpful.
Regarding your second question: yes, of course, if you want two-sided test, then the two-sided option is the best :) Note that you might want to use pvalue correction since you are doing a lot of multiple testing.
Dear Carlo,
thank you for your answer and please excuse me for any missing parts:
1) I created the matrix with only a row, as each of the experiment scores would be in each of the 33132 columns. Then finally i use the stack function after converse it to a data frame-
2) The 33132 number is just a small simplicity in order not to post a too long code chunk-it is actually the total number of drug pertubagen experiments (both for up and down-regulated gene sets)
3) My question about the argument of alternative for the one-side versus two-sided tests, is that because i essentially search fo r enrichments of DE genes, in the up-regulated/pertubated and down-regulated gene-sets respectively, whereas "greater" option for enrichment is more suitable here.
I will also provide as soon as possible a more illustrative example
Here is a more illustrative example, without the for loop and with just one "arbitary" experiment: