Question

No genes mapped in clusterprofiler gseGO

0

Entering edit mode

4 months ago

jon50250 • 0

Hello! I'm having issues generating an adequate geneList for running gseGO in clusterProfiler, using keytype = "GO"

Similar issues have been described here: No gene mapped

gseGO code is:

gse <- gseGO(geneList = gene_List, 
         ont = "ALL", #ont one of “BP”, “MF”, “CC” or “ALL”
         OrgDb = OrgDb,
         minGSSize = 3, # minimum number of genes in set (gene sets with lower than this many genes in your dataset will be ignored).
         maxGSSize = 800, #maximum number of genes in set (gene sets with greater than this many genes in your dataset will be ignored).      
         keyType = "GO",
         exponent = 1,
         pvalueCutoff = 0.05, 
         pAdjustMethod = "BH", #one of “holm”, “hochberg”, “hommel”, “bonferroni”, “BH”, “BY”, “fdr”, “none”
         verbose = TRUE, 
         )

and error is:

    preparing geneSet collections...
 --> Expected input gene ID: GO:2000274,GO:0031534,GO:0045503,GO:0008081,GO:0009415,GO:0004386
    Error in check_gene_id(geneList, geneSets) : 
      --> No gene can be mapped....

Gene list code and output is:

Read in differential expression result table - need foldchange values here for gsea
data = read.table("res.df.txt", 
                  header = TRUE,
                  row.names = 1,
                  #res.df.txt
                  #M.DEGs.res.df.txt
                  #F.DEGs.res.df.txt
                  )
data <- dplyr::select(data, c('log2FoldChange')) %>%
  arrange(desc(log2FoldChange)) ## sort the list in decreasing order (required for clusterProfiler)
write.csv(data, file = "res.L2FC.csv") 
#making geneList
#need to get KEGG ID and LFC in same df for geneList?
#Setting rownames of res file to first column as ID
data <- rownames_to_column(data, "ID")
#leftjoin go and res dfs by geneID
go.l2fc <- left_join(EggNOG.GO.gIDs, data)
#remove rows with NO L2FC value - the results shows all genes, some of which have GO terms 
#but no significance for the deseq2 dataset <- https://stackoverflow.com/questions/4862178/remove-rows-with-all-or-some-nas-missing-values-in-data-frame
go.l2fc <- na.omit(go.l2fc)
#have multiple geneIDs/LFC now so need to figure out how to get TOP $l2fc/lfc/etc. result for each GO ID
go.l2fc <- go.l2fc[!duplicated(go.l2fc[1]),]
colnames(go.l2fc)[2] <- "GOs"
#adding random rank to same l2fc values (for geneIDs) for unique K#s that come from different transcriptIDs
go.l2fc <- go.l2fc %>% mutate(rank = rank(log2FoldChange, ties.method = "random")) %>% dplyr::arrange(desc(rank))

## assume 1st column is ID
## 2nd column is FC
## feature 1: numeric vector (L2FC value)
geneList = go.l2fc[,3]
## feature 2: named vector (GO terms or geneIDs??)
names(geneList) = as.character(go.l2fc[,2])
#feature 2.5: Adding geneNames to middle column?
## feature 3: decreasing order
geneList = sort(geneList, decreasing = TRUE)

and the geneList/go.l2fc files look like this (see screenshots)

geneList

go.l2fc

Thank you for all the help!

R clusterProfiler gseGO GSEA • 734 views

ADD COMMENT • link 4 months ago by jon50250 • 0

0

Entering edit mode

It's very unclear why you are using GO terms as gene ids. This should not be the case at all.

ADD REPLY • link 4 months ago by jv ★ 1.8k

0

Entering edit mode

The error says the expected input geneID is GO terms though?

Expected input gene ID: GO:2000274,GO:0031534,GO:0045503,GO:0008081,GO:0009415,GO:0004386

I remade the geneList with my geneIDs and reran. Still got the same error. From the manual, I NEED to have entrezIDs for my GO terms then?

Maybe I need to run enricher fucntion then? from this post: Clusterprofiler error: No gene can be mapped

ADD REPLY • link 4 months ago by jon50250 • 0

1

Entering edit mode

As far as I can tell the default GO term mapping in Bioconductor Org.eg.db is to Entrez ids. If you use gseGO you are locked into using an Org.eg.db, so yes it looks like you need Entrez ids, but note the ranked gene list should be a named numerical vector with the value used to rank the genes where names are the gene names (this is not well documented in the clusterProfiler manual).

For more flexibility regarding gene ids you can use the generic clusterProfiler::GSEA function where you provide the TERM2GENE and TERM2NAME mappings for your gene sets. I can provide some example code for this but don't have time at this exact moment.

ADD REPLY • link 4 months ago by jv ★ 1.8k

0

Entering edit mode

Thank you! I can convert my gene Ids (g####) to entrez IDs (I think). I ran EggNOG on my genes for annotation - there may be a way to extract entrezIDs from this output. Would the output look like what I have here though, just with entrezIDs? For my geneList, I would need Names to be the entrezIDs associated with each set of GO terms and a L2FC value to rank the genes?

Will the multiple GO:terms-per-gene be suitable for gseGO too? I ran gseKEGG with a single Knumber and L2FC value (see screenshot), even though my genes were annotated with many K numbers.

I would really appreciate some code when you have the time, tysm

enter image description here

ADD REPLY • link 4 months ago by jon50250 • 0

2

Entering edit mode

Please keep your follow up questions to a minimum, it is overwhelming to try to answer so many questions.

DO NOT use log2 fold change alone for the ranking, you should take statistical significance into account. I use DESeq2 for DE testing and rank genes by the stat metric returned from DESeq test. Other option are described at https://www.gsea-msigdb.org/gsea/doc/GSEAUserGuideFrame.html?Run_GSEA_Page

Will the multiple GO:terms-per-gene be suitable for gseGO too

most genes will map to more than one go term, the suitability of this is debated in the field but it is not something you need to account for in your code (i.e. you shouldn't need to filter)

Here is some example code I can provide:

# download MSigDB hallmark gene sets
hallmark_gs <- msigdbr::msigdbr(species = "Mus musculus", category = "H")
hallmark_gs$gs_name <- stringr::str_replace(hallmark_gs$gs_name, "HALLMARK_", "")
# format gene set for clusterProfiler use
hallmark_paths <- list("TERM2GENE" = dplyr::select(hallmark_gs, gs_id, ensembl_gene), 
                      "TERM2NAME" = dplyr::select(hallmark_gs, gs_id, gs_name))
colnames(hallmark_paths$TERM2GENE) <- c("from", "to")
colnames(hallmark_paths$TERM2NAME) <- c("from", "to")

df <- arrange(df, desc(stat)) # where df is deseq2 results
ranked_genes <- df$stat
names(ranked_genes) <- df$ensembl_gene_id
set.seed(1234) # set for reproducible results
out <- GSEA(geneList = ranked_genes, 
pvalueCutoff = 1, 
pAdjustMethod = "BH", 
verbose = FALSE,
seed = TRUE,
eps = 0,
TERM2GENE = hallmark_paths$TERM2GENE, 
TERM2NAME = hallmark_paths$TERM2NAME)

ADD REPLY • link 4 months ago by jv ★ 1.8k

0

Entering edit mode

Awesome, thanks! I will use stat to rank my genes. I will not spam in the future, apologies!

I have the GO:terms, but I need to generate TERM2NAME (I see this is optional for GSEA()). I will checkout this thread: https://support.bioconductor.org/p/9144621/

ADD REPLY • link 4 months ago by jon50250 • 0