Question

Appropriate creation of contigency table and impementation of fisher's exact test, for functional enrichment analysis in R

0

Entering edit mode

7.4 years ago

svlachavas ▴ 790

Dear Biostars Community,

i would like to implement and perform in R a fisher's exact test for gene-set enrichment analysis, with the goal of testing for overepresentation of DE genes, in various gene-sets based on a pertubagen. For my purpose, i used the following R script (small part of the code):

part of the code which tests the independence of upregulated user input DE symbols, and upregulated gene-sets of drug pertubagen experiments:

....
 score.matrix.total <- matrix(NA, nrow=1, ncol=33132) # **a matrix that will save the fisher's p values below**
  dimnames(score.matrix.total)[[2]] <- as.character(names(list.drug.experiments.up))  # **the name of each of the experiments with upregulated genes**


for (i in 1:length(list.drug.experiments.up)) {

  mat <- matrix(c(length(background.universe.symbols)-length(union(genes.up,list.L1000.up[[i]])),
  length(setdiff(genes.up, list.L1000.up[i])),
  length(setdiff(list.L1000.up[[i]],genes.up)),
  length(intersect(genes.up,list.L1000.up[[i]]))
  ),
  nrow=2) 

  fr <- fisher.test(mat,alternative = "two.sided") 
  score.matrix.total[,i] <- fr$p.value

              }

  dat <-as.data.frame(score.matrix.total)
  dat2 <- stack(dat)....

For the clarification of the above code:

**genes.up=the upregulated user's gene symbols

list.L1000.up= an R object list that contains the upregulated genes for each pertubagen experiment

list.drug.experiments.up=a character vector of the same length as list.L1000.up, which contains the names of all experiments.**

background.universe.symbols= the total gene symbols used in the experiment

Thus, my questions are the following:

1) Concerning the construction of the needed contigency table: is my approach correct, about the creation of the contigency table, as an input for the fisher's exact test ? or there is a more simple implementation ?

2) Regarding the fisher's exact test function: "two.sided" option is the most safe, for two-sided tests ? or in this case, as i search for over-representation, and thus enrichment, the alternative option "greater" would also be more suitable ?

I also perform the same code for the relative down-regulated input genes, and the associated downregulated gene-sets experiments, for each of the same drugs.

fishers exact test overepresentation analysis R • 3.1k views

ADD COMMENT • link updated 7.4 years ago by Carlo Yague 9.0k • written 7.4 years ago by svlachavas ▴ 790

0

Entering edit mode

I find your code difficult to understand, starting from the first line: Why is there only one row in the score.matrix.total matrix? What does the 33132 columns correspond to?

Also, a small example of the expected input/output would be helpful.

Regarding your second question: yes, of course, if you want two-sided test, then the two-sided option is the best :) Note that you might want to use pvalue correction since you are doing a lot of multiple testing.

ADD REPLY • link 7.4 years ago by Carlo Yague 9.0k

0

Entering edit mode

Dear Carlo,

thank you for your answer and please excuse me for any missing parts:

1) I created the matrix with only a row, as each of the experiment scores would be in each of the 33132 columns. Then finally i use the stack function after converse it to a data frame-

2) The 33132 number is just a small simplicity in order not to post a too long code chunk-it is actually the total number of drug pertubagen experiments (both for up and down-regulated gene sets)

3) My question about the argument of alternative for the one-side versus two-sided tests, is that because i essentially search fo r enrichments of DE genes, in the up-regulated/pertubated and down-regulated gene-sets respectively, whereas "greater" option for enrichment is more suitable here.

I will also provide as soon as possible a more illustrative example

ADD REPLY • link 7.4 years ago by svlachavas ▴ 790

0

Entering edit mode

Here is a more illustrative example, without the for loop and with just one "arbitary" experiment:

head(dat) # a data table with selected DE genes
  Rank Gene.Symbol                           Definition Clusters Enriched.Clusters Interactors
1    1        CCT8 chaperonin containing TCP1 subunit 8       10                 9           6
2    2        CCT3 chaperonin containing TCP1 subunit 3       10                 9           8
3    3        CCT2 chaperonin containing TCP1 subunit 2        9                 9           7
4    4        CCT7 chaperonin containing TCP1 subunit 7        9                 9           6
5    5        CCT5 chaperonin containing TCP1 subunit 5        9                 9           7
6    6       PARP1        poly(ADP-ribose) polymerase 1        8                 8           1
  Drugs Fold.Change  X
1     0      0.6126 NA
2     1      0.7894 NA
3     0      0.8877 NA
4     0      0.7911 NA
5     0      0.6376 NA
6    15      0.5983 NA

genes.up <- dat$Gene.Symbol[which(dat$Fold.Change>0)]
        genes.down <- dat$Gene.Symbol[which(dat$Fold.Change<0)]
        gene.symbols.total <- dat$Gene.Symbol

length(list.L1000.up)
[1] 33132



> head(L1000.symbols)
[1] "A1CF"   "A2M"    "A4GALT" "A4GNT"  "AAAS"   "AACS"  

> length(L1000.symbols)
[1] 12716

> list.L1000.up[1]
$`CPC001_HA1E_24H-hemado-10.0`
 [1] "GABBR2"   "RPL23AP7" "APLP1"    "ANGPT2"   "IL15"     "TXNRD1"   "KBTBD11"  "CYP39A1" 
 [9] "RBKS"     "PTMA"     "DMWD"     "ITGB1BP1" "YME1L1"   "APLNR"    "CDK1"     "ETF1"    
[17] "KCNK7"    "AHCYL2"   "ALDH1L1"  "NRXN2"    "NPRL3"    "ITGA2B"   "IGHMBP2"  "FAM125B" 
[25] "SFRP1"    "ABAT"     "FZR1"     "TCEAL4"   "NRIP3"    "CA6"      "EXOSC4"   "RPL18"   
[33] "IMPG1"    "LGI2"     "MBNL1"    "CAST"     "LIMCH1"   "ABCA1"    "KCND3"    "EPHX2"   
[41] "KATNA1"   "CDC42"    "FKBP1A"   "ASPN"     "HSPG2"    "LMX1B"    "MAB21L1"  "TADA3"   
[49] "LCN2"     "PRR16"    "EIF3G"    "DNMT1"    "RNF167"   "SNTB2"    "LGR4"     "TSTA3"   
[57] "ITGB6"    "ITSN1"    "MLNR"     "BAZ2A"    "ATP2A3"   "TAPBPL"   "SNX11"    "CDC14B"  
[65] "HSPA13"   "GRM8"     "ENPP1"    "CCDC53"   "APOF"     "CTNNAL1"  "JAK3"     "CEP76"   
[73] "PCBD1"    "IER3"    



> mat <- matrix(c(length(L1000.symbols)-length(union(genes.up,list.L1000.up[[1]])),
+                 length(setdiff(genes.up, list.L1000.up[1])),
+                 length(setdiff(list.L1000.up[[1]],genes.up)),
+                 length(intersect(genes.up,list.L1000.up[[1]]))
+ ),
+ nrow=2)
> head(mat)
      [,1] [,2]
[1,] 12633   73
[2,]    10    1
> rownames(mat) <- c("NotInDrugExperiment","InDrugExperiment")
> colnames(mat) <- c("NotInUpregulatedGenes","InUpregulatedGenes")

 head(mat)
                    NotInUpregulatedGenes InUpregulatedGenes
NotInDrugExperiment                 12633                 73
InDrugExperiment                       10                  1


> fr <- fisher.test(mat,alternative = "greater")
> head(fr$p.value)
[1] 0.06220236

ADD REPLY • link 7.4 years ago by svlachavas ▴ 790

score 2 · Answer 1 · 2018-02-21

2

Entering edit mode

7.4 years ago

Carlo Yague 9.0k

Thank you for your detailed comment.

1) The strategy used to build your matrix is perfectly fine in theory. But in practice, your numbers do not add up to the total:

sum(mat) #12717
length(L1000.symbols) #12716
sum(mat)==length(L1000.symbols)
[1] FALSE

This probably means that there is at least one symbol in either genes.up or list.L1000.up[[1]] that is not in the L1000.symbols vector. You can check that comparing.

length(genes.up)
length(intersect(genes.up,L1000.symbols))
genes.up[which(! genes.up %in% L1000.symbols)]

length(list.L1000.up[[1]])
length(intersect(list.L1000.up[[1]],L1000.symbols))
list.L1000.up[[1]][which(! list.L1000.up[[1]] %in% L1000.symbols)]

2) In my opinion, both alternatives are ok, as long as you are clear about it. Don't forget to use some kind of p-value correction for multiple testing (sorry for re-iterating, but this is really important).

ADD COMMENT • link 7.4 years ago by Carlo Yague 9.0k

0

Entering edit mode

Dear Carlo,

thank you for your updated answer and considerations. Because with the number discrepancy you mentioned something very crusial:

concerning the background universe that i used-that is the L1000.symbols-these, are the total genes used by the L1000 technology (Cmap improvement) to be considered as the total universe of genes. However, it is very possible, that some of the input DE gene symbols of the user DE list for example, that could not be included/mapped in this specific background universe. In your opinion, it is better and more appropriate, for each DE list input, to merge/or perform a union with the already defined background genes vector, to define each time a "more" appropriate background universe ?

ADD REPLY • link 7.4 years ago by svlachavas ▴ 790

1

Entering edit mode

This is what I would do (do the same for genes.down):

genes.up <- dat$Gene.Symbol[which(dat$Fold.Change>0 & dat$Gene.Symbol %in% L1000.symbols)]

if(length(genes.up) != sum(dat$Fold.Change>0)){
    print(paste("WARNING: only", length(genes.up), "out of", sum(dat$Fold.Change>0), "symbols are matched."))

ADD REPLY • link 7.4 years ago by Carlo Yague 9.0k

0

Entering edit mode

Thank you Carlo for your confirmation and code suggested !!

at the start, i thought to define a total universe vector, using the total DE genes with the L1000 symbols-however, the separate up and down approach now seems more practical and appropriate, due to different fisher's exact tests.

One last comment-question that i would like your opinion:

based the issue of multiple testing correction, do you think that there is a reliable approach concerning the merging of the two fisher exact test p-values, into a unified p-value, for the total experiment, and then mutliple testing correction on the aggregated p-value ?

and if so, which methodology i should use ? for instance, fisher's method of aggregating p-values, or i could more simply compute their product ?

ie. combo.pval = up.pval X down.pval- ((up.pval Xdown.pval)X (log( up.valXdown.pval)) ?

ADD REPLY • link 7.4 years ago by svlachavas ▴ 790

1

Entering edit mode

As you said, the fisher's exact tests are distinct for the up and down-regulated genes. So likewise, the p-value correction should be applied separately. In my opinion, aggregating the pvalues doesn't make much sense because the gene lists (up and down) are different.

ADD REPLY • link 7.4 years ago by Carlo Yague 9.0k