1, obtain input genes

Question

plotting a dotplot with topGO output

2

Entering edit mode

4.7 years ago

rachid.boumlik ▴ 50

Hi,

I have perfomed GO analysis using topGO package and was wondering if there is a way to use ClusterProfiler or any other package to plot a dotplot of the enriched GO terms same as in ClusterProfiler output. Many thanks!

RNA-Seq GO topGO dotplot clusterProfiler • 19k views

ADD COMMENT • link updated 3.5 years ago by svlachavas ▴ 790 • written 4.7 years ago by rachid.boumlik ▴ 50

0

Entering edit mode

Hi, Do you have enrichment data in a tabular or something format? Not dotplots, but with Revigo you can make quick scatter, network and treemap plots. WARNING: it uses flash to plot in web browser (flash will die soon in chrome)! Another (nicer) tool is GOplot. But in this case you need basic R programming knowledge, like ClusterProfiler. Now, deppending on the format of results you have, you can simply make dotplots in excel and R. Let us see an example of your data format and also an example of the actual ClusterProfile plot you refered to

ADD REPLY • link 4.7 years ago by Arsenal ▴ 160

1

Entering edit mode

Hi Arsenal Thank you for your suggetions. I am talking about ploting a dotplot like this: enter image description here

ADD REPLY • link 4.7 years ago by rachid.boumlik ▴ 50

1

Entering edit mode

the data I got from topGO is basically a table showing top10 enriched GO tems with number of annotated genes belonging to each term in the initial gene list, the number of significant genes in the list, the number of expeced genes in the list and the p-value of fisher exact test. GO.ID Term Annotated Significant Expected classicFisher GO:0009834 plant-type secondary cell wall biogenesi... 12 12 4.53 8.10E-06 GO:0009768 photosynthesis, light harvesting in phot... 13 12 4.91 6.90E-05 GO:0018298 protein-chromophore linkage 13 12 4.91 6.90E-05 GO:0015979 photosynthesis 69 50 26.06 0.00015 GO:0052546 cell wall pectin metabolic process 14 12 5.29 0.00032 GO:0009251 glucan catabolic process 11 10 4.16 0.00042 GO:0007018 microtubule-based movement 11 10 4.16 0.00042 GO:0017144 drug metabolic process 149 81 56.28 0.0007 GO:0055085 transmembrane transport 258 126 97.46 0.0012 GO:0006073 cellular glucan metabolic process 38 24 14.35 0.00124

ADD REPLY • link 4.7 years ago by rachid.boumlik ▴ 50

0

Entering edit mode

Well, If you need a simple plot like this, it is easily doable in R. If you send me the table I can show you a basic R script so you can do it yourself whenever you want.

ADD REPLY • link 4.7 years ago by Arsenal ▴ 160

1

Entering edit mode

the data I got from topGO which i wanna plot in a dotplot looks somrthing like this: GO.ID Term Annotated Significant Expected classicFisher GO:0009834 plant-type secondary cell wall biogenesi... 12 12 4.53 8.10E-06
GO:0009768 photosynthesis, light harvesting in phot... 13 12 4.91 6.90E-05 GO:0018298 protein-chromophore linkage 13 12 4.91 6.90E-05 GO:0015979 photosynthesis 69 50 26.06 0.00015 GO:0052546 cell wall pectin metabolic process 14 12 5.29 0.00032

Normally it would nice if I can plot the gene ratio which i think is the ratio of significant/Annotated p-value would be the classicFisher, but I m not sure what could be used for the gene count (maybe the significant ?).

Thank you for your help!

ADD REPLY • link 4.7 years ago by rachid.boumlik ▴ 50

0

Entering edit mode

I guess the gene count is the sum of significant genes assigned to a specific GO term. I'll write a generic R script where you only need to input your tabular results. Have you tried ClusterProfile yet?

ADD REPLY • link 4.7 years ago by Arsenal ▴ 160

0

Entering edit mode

No I haven't tried it yet.

ADD REPLY • link 4.7 years ago by rachid.boumlik ▴ 50

0

Entering edit mode

Thank you for the beautiful piece of code. I was wondering if there was a way to include BP, CC, and MF all in one dotplot graph as opposed to 3 different graphs. How would I do that?

ADD REPLY • link 4.2 years ago by Osman ▴ 10

0

Entering edit mode

Hey Osman, you are welcome. For that, I would generate a separate goEnrichment table for each of CC, BP, and MF, and then just row-bind these (rbind()) before using in ggplot2.

To use CC and MF, one just needs to change these:

allGO2genes <- annFUN.org(
  whichOnto = "BP",
  ...)
GOdata <- new(
  "topGOdata",
  ontology = "BP", # also available are 'CC' and 'MF'
  ...)

ADD REPLY • link 4.2 years ago by Kevin Blighe 89k

1

Entering edit mode

Thank you so much for your quick response! I tried to include your suggestions onto the code by I get the following error "Error in rep(argl[[na]], length.out = nr) : attempt to replicate an object of type 'S4' " . Also, I want to apply this to single cell data and can't think of a way to visualize all the GOterms for each cell type in one figure. My code blew is what I currently have it is for only 1 cell type. Again, thank you so much for your help.

clusterL2_3_IT <- subset(experiment.aggregate, idents = 'L2_3_IT')
expr <- as.matrix(GetAssayData(clusterL2_3_IT))

#cluster0 <- subset(experiment.aggregate, idents = '0')
#expr <- as.matrix(GetAssayData(cluster0))
# Select genes that are expressed > 0 in at least 75% of cells (somewhat arbitrary definition)
n.gt.0 <- apply(expr, 1, function(x)length(which(x > 0)))
expressed.genes <- rownames(expr)[which(n.gt.0/ncol(expr) >= 0.75)]
all.genes <- rownames(expr)

# define geneList as 1 if gene is in expressed.genes, 0 otherwise
geneList <- ifelse(all.genes %in% expressed.genes, 1, 0)
names(geneList) <- all.genes

# Create topGOdata object
GOdataBP <- new("topGOdata",
    ontology = "BP", # use biological process ontology
    allGenes = geneList,
    geneSelectionFun = function(x)(x == 1),
  annot = annFUN.org, mapping = "org.Mm.eg.db", ID = "symbol")
GOdataCC <- new("topGOdata",
    ontology = "CC", # use biological process ontology
    allGenes = geneList,
    geneSelectionFun = function(x)(x == 1),
  annot = annFUN.org, mapping = "org.Mm.eg.db", ID = "symbol")
GOdataMF <- new("topGOdata",
        ontology = "MF", # use biological process ontology
        allGenes = geneList,
        geneSelectionFun = function(x)(x == 1),
    annot = annFUN.org, mapping = "org.Mm.eg.db", ID = "symbol")
GOdata <- rbind(GOdataBP,GOdataCC,GOdataMF)

# Test for enrichment using Fisher's Exact Test
    resultFisher <- runTest(GOdata, algorithm = "elim", statistic = "fisher")
    GenTable(GOdata, Fisher = resultFisher, topNodes = 20, numChar = 60)
#Visualize GoTerms

results.fisher <- runTest(GOdata, algorithm = "elim", statistic = "fisher")
GenTable(GOdata, Fisher = resultFisher, topNodes = 20, numChar = 50)
goEnrichment <- GenTable(
  GOdata,
  Fisher = resultFisher,
  orderBy = "Fisher",
  topNodes = 20,
  numChar = 50)
goEnrichment$Fisher <- as.numeric(goEnrichment$Fisher)
goEnrichment <- goEnrichment[goEnrichment$Fisher < 0.05,] # filter terms for Fisher p<0.05
goEnrichment <- goEnrichment[,c("GO.ID","Term","Fisher")]
goEnrichment

require(ggplot2)
library(scales)

ntop <- 20
ggdata <- goEnrichment[1:ntop,]
ggdata$Term <- factor(ggdata$Term, levels = rev(ggdata$Term)) # fixes order
ggplot(ggdata,
  aes(x = Term, y = -log10(Fisher), size = -log10(Fisher), fill = -log10(Fisher))) +

  expand_limits(y = 1) +
  geom_point(shape = 21) +
  scale_size(range = c(2.5,12.5)) +
  scale_fill_continuous(low = 'royalblue', high = 'red4') +

  xlab('') + ylab('Enrichment score') +
  labs(
    title = 'GO Analysis',
    #subtitle = 'Top 50 terms ordered by Kolmogorov-Smirnov p-value',
    subtitle = 'Top 20 terms ordered by Fisher Exact p-value',
    caption = 'Cut-off lines drawn at equivalents of p=0.05, p=0.01, p=0.001') +

  geom_hline(yintercept = c(-log10(0.05), -log10(0.01), -log10(0.001)),
    linetype = c("dotted", "longdash", "solid"),
    colour = c("black", "black", "black"),
    size = c(0.5, 1.5, 3)) +

  theme_bw(base_size = 24) +
  theme(
    legend.position = 'right',
    legend.background = element_rect(),
    plot.title = element_text(angle = 0, size = 16, face = 'bold', vjust = 1),
    plot.subtitle = element_text(angle = 0, size = 14, face = 'bold', vjust = 1),
    plot.caption = element_text(angle = 0, size = 12, face = 'bold', vjust = 1),

    axis.text.x = element_text(angle = 0, size = 12, face = 'bold', hjust = 1.10),
    axis.text.y = element_text(angle = 0, size = 12, face = 'bold', vjust = 0.5),
    axis.title = element_text(size = 12, face = 'bold'),
    axis.title.x = element_text(size = 12, face = 'bold'),
    axis.title.y = element_text(size = 12, face = 'bold'),
    axis.line = element_line(colour = 'black'),

    #Legend
    legend.key = element_blank(), # removes the border
    legend.key.size = unit(1, "cm"), # Sets overall area/size of the legend
    legend.text = element_text(size = 14, face = "bold"), # Text size
    title = element_text(size = 14, face = "bold")) +

  coord_flip()
ggplot2::ggsave("L2_3_IT_GOTerms_P30_Cortex_Fisher.pdf",
                device = NULL,
                height = 8.5,
                width = 12)

ADD REPLY • link updated 4.2 years ago by Kevin Blighe 89k • written 4.2 years ago by Osman ▴ 10

0

Entering edit mode

Hi Osman, from here, I cannot see which line of code is producing the error. Given how I have limited time each day, I therefore cannot go through this line by line trying to debug it. I suggest that you identify the line [of code] that is producing the error, learn how the particular function in question (that is, the one that's producing the error) inputs and uses data, and then work back a few steps before you identify what is causing the error.

For single cell data, I am not sure... I would be interested in seeing enrichment per cluster.

ADD REPLY • link 4.2 years ago by Kevin Blighe 89k

0

Entering edit mode

Fair enough Kevin, I understand. For the single cell data, I would like to see enrichment per cluster (cell type) also. Thats what I am trying to make a figure for.

ADD REPLY • link 4.2 years ago by Osman ▴ 10

score 11 · Answer 1 · 2021-01-30

11

Entering edit mode

4.5 years ago

Kevin Blighe 89k

Here's one 'on the house', in part copied from 2 previous answers that I have given on this topic:

1, obtain input genes

rankedGenes <- c(1:20)
names(rankedGenes) <- c("BRCA1","BRCA2","TP53", "ATM", "ERCC1",
  "BRCC3", "TP63", "ERCC2", "ERCC3", "ERCC4", "TP63",
  "S100A2", "CXCL2", "CXCL3", "MS4A1", "MS4A2", "VCAM1",
  "VSX1", "CD8", "S100A2")

rankedGenes
 BRCA1  BRCA2   TP53    ATM  ERCC1  BRCC3   TP63  ERCC2  ERCC3  ERCC4   TP63 
     1      2      3      4      5      6      7      8      9     10     11 
S100A2  CXCL2  CXCL3  MS4A1  MS4A2  VCAM1   VSX1    CD8 S100A2 
    12     13     14     15     16     17     18     19     20

So, in my genes, BRCA1 is ranked highest... S100A2 is lowest.

2, set-up the enrichment object based on your genes

require(topGO)
require(org.Hs.eg.db)

selection <- function(x) TRUE
allGO2genes <- annFUN.org(
  whichOnto = "BP",
  feasibleGenes = NULL,
  mapping = "org.Hs.eg.db",
  ID = "symbol")
GOdata <- new(
  "topGOdata",
  ontology = "BP", # also available are 'CC' and 'MF'
  allGenes = rankedGenes,
  annot = annFUN.GO2genes,
  GO2genes = allGO2genes,
  geneSel = selection,
  nodeSize = 5)

3, perform gene enrichment.

Here, we make use of the rank information by using the Kolmogorov-Smirnov (K-S) test on the enriched terms

results.ks <- runTest(GOdata, algorithm = "classic", statistic = "ks")
goEnrichment <- GenTable(
  GOdata,
  KS = results.ks,
  orderBy = "KS",
  topNodes = 100,
  numChar = 99)
goEnrichment$KS <- as.numeric(goEnrichment$KS)
goEnrichment <- goEnrichment[goEnrichment$KS < 0.05,] # filter terms for KS p<0.05
goEnrichment <- goEnrichment[,c("GO.ID","Term","KS")]
goEnrichment

4, generate a plot of the enrichment results

require(ggplot2)
library(scales)

ntop <- 30
ggdata <- goEnrichment[1:ntop,]
ggdata$Term <- factor(ggdata$Term, levels = rev(ggdata$Term)) # fixes order
gg1 <- ggplot(ggdata,
  aes(x = Term, y = -log10(KS), size = -log10(KS), fill = -log10(KS))) +

  expand_limits(y = 1) +
  geom_point(shape = 21) +
  scale_size(range = c(2.5,12.5)) +
  scale_fill_continuous(low = 'royalblue', high = 'red4') +

  xlab('') + ylab('Enrichment score') +
  labs(
    title = 'GO Biological processes',
    subtitle = 'Top 30 terms ordered by Kolmogorov-Smirnov p-value',
    caption = 'Cut-off lines drawn at equivalents of p=0.05, p=0.01, p=0.001') +

  geom_hline(yintercept = c(-log10(0.05), -log10(0.01), -log10(0.001)),
    linetype = c("dotted", "longdash", "solid"),
    colour = c("black", "black", "black"),
    size = c(0.5, 1.5, 3)) +

  theme_bw(base_size = 24) +
  theme(
    legend.position = 'right',
    legend.background = element_rect(),
    plot.title = element_text(angle = 0, size = 16, face = 'bold', vjust = 1),
    plot.subtitle = element_text(angle = 0, size = 14, face = 'bold', vjust = 1),
    plot.caption = element_text(angle = 0, size = 12, face = 'bold', vjust = 1),

    axis.text.x = element_text(angle = 0, size = 12, face = 'bold', hjust = 1.10),
    axis.text.y = element_text(angle = 0, size = 12, face = 'bold', vjust = 0.5),
    axis.title = element_text(size = 12, face = 'bold'),
    axis.title.x = element_text(size = 12, face = 'bold'),
    axis.title.y = element_text(size = 12, face = 'bold'),
    axis.line = element_line(colour = 'black'),

    #Legend
    legend.key = element_blank(), # removes the border
    legend.key.size = unit(1, "cm"), # Sets overall area/size of the legend
    legend.text = element_text(size = 14, face = "bold"), # Text size
    title = element_text(size = 14, face = "bold")) +

  coord_flip()

ADD COMMENT • link 4.5 years ago by Kevin Blighe 89k

0

Entering edit mode

Result interpretation by Kevin Blighe, Ph.D.:

Sure, in the plot, the values on the x-axis are just the negative log [base 10] (log10) of the p-values from the enrichment test (the test used was Kolmogorov-Smirnov test). We also use negative log10 p-values for, e.g., Manhattan plots in GWAS, and Volcano Plots in differential expression analysis.

The most statistically significant enriched term in this example was 'cellular response to DNA damage stimulus', meaning that the input genes are related to this biological process.

For this example, I purposefully chose genes that I knew are involved in DNA double-strand break repair, plus a few other unrelated immune system genes. - By Kevin Blighe, Ph.D.

ADD REPLY • link 4.5 years ago by gauravj49 • 0

0

Entering edit mode

Thank you!

ADD REPLY • link 4.4 years ago by Kevin Blighe 89k

0

Entering edit mode

thank you a lot for your explanation. I used on my data and it worked fine. is it possible to specify a GO and use it for gene enrichment analysis? and also is it possible to apply it to KEGG pathway analysis? also instead of Kolmogorov-Smirnov (K-S) test , can we use the Fisher's exact test and use for making the dot plot?

ADD REPLY • link 4.0 years ago by najibveto ▴ 130

0

Entering edit mode

Hi, yes, you can modify any part of this analysis - please use this code as a guide. However, some of the variable names may change, for example, if you choose to not use the K-S test, then goEnrichment$KS will be something else

ADD REPLY • link 4.0 years ago by Kevin Blighe 89k

0

Entering edit mode

Thank you a lot, I used and I got figure as fellow : enter image description here

I am curious if it is possible to use multiple test FDR because I didn't find it in TopGo package? second question, i have three strains and and i would like to combine them in same graph using them as X-axis? thanks a lot for your help.

ADD REPLY • link 4.0 years ago by najibveto ▴ 130

0

Entering edit mode

Great!

Regarding p-value adjustment, it is not necessary in all situations. Which statistical test did you use?

Regarding combining in the same graph, you would have to create separate ggdata objects, representing different enrichments for different groups, and then combine these into a single object. There would have to be a separate column for group / enrichment, like this:

Term     p     enrichment
GO:1234  0.04  GroupA
GO:5678  0.1   GroupA
GO:1234  0.01  GroupB
GO:5678  0.05  GroupB

Then, in the ggplot() function, you would need something like:

ggplot(..., aes(..., shape = enrichment)) +

ADD REPLY • link 4.0 years ago by Kevin Blighe 89k

0

Entering edit mode

thank you a lot. I used the Kolmogorov-Smirnov (K-S) test and used KS instead of p value. for creating a separated ggdata objects, there isn't any column for group, how is it possible to add that column and how is it possible to combine the different ggdata? I am sorry for my questions.

ADD REPLY • link 4.0 years ago by najibveto ▴ 130

0

Entering edit mode

Dear Kevin Blighe,

thank you one more time for your very nice and comprehensive vignette regarding functional enrichment analysis; I have one particular question regarding the selection of appropriate ranking statistics in GSEA analysis, as you illustrate from above:

briefly, based on a current project I have developed a computational score to rank mutations from WES data for each sequenced patient, and collectively profiled clinical subgroups, which ranges from 0 to 1; thus, my ultimate idea, was to run enrichment analysis for each patient gene mutational list, ranked but only removing genes with 0 score, so not conferring any impact to the perturbed phenotype; My crucial questions are:

As I would like to use either GO or REACTOME pathways with gene-set enrichment analysis, to take into account the score ranking, you think it is feasible? Or my notion is incorrect, and the ranking would make sense, if my two extremes-high ranked or low ranked-would contribute significantly? Like ordering a list based on fold-changes? And in my case that I have a lowly ranked gene, GSEA is not feasible and would be erroneous for my scenario?

Best,

Efstathios

ADD REPLY • link 3.5 years ago by svlachavas ▴ 790