I am trying to use the gseKEGG function in the R package clusterProfiler with the KEGG database entries for Streptomyces coelicolor.
This code snippet confirms that KEGG supports this organism, with kegg_code of 'sco'.
search_kegg_organism('sco', by='kegg_code')
returns
kegg_code scientific_name common_name
2810 scon Streptococcus constellatus subsp. pharyngis C232 <NA>
2811 scos Streptococcus constellatus subsp. pharyngis C818 <NA>
3495 sco Streptomyces coelicolor <NA>
Examining the database T00085 at https://www.genome.jp/dbget-bin/get_linkdb?-t+genes+gn:T00085 seems to confirm that the format for the gene ids in the KEGG database is the widely used SCOnnnn, which happily is the format I use in my datasets - first few lines from the database replicated below :
sco:SCO0001 no KO assigned | (RefSeq) SCEND.02c; hypothetical protein
sco:SCO0002 no KO assigned | (RefSeq) SC8E7.42c, SCEND.01c, SCJ24.01c; hypothetical protein
sco:SCO0003 no KO assigned | (RefSeq) SC8E7.41c; DNA-binding protein
sco:SCO0004 no KO assigned | (RefSeq) SC1C9.01, SC8E7.40c; hypothetical protein
sco:SCO0005 no KO assigned | (RefSeq) SC1C9.02; transposase
sco:SCO0006 no KO assigned | (RefSeq) SC1C9.03, SCJ30.01; ATP/GTP-binding protein
sco:SCO0007 no KO assigned | (RefSeq) SCJ30.02c; hypothetical protein
sco:SCO0008 no KO assigned | (RefSeq) SCJ30.03c; hypothetical protein
sco:SCO0009 no KO assigned | (RefSeq) SCJ30.04c; hypothetical protein
sco:SCO0010 no KO assigned | (RefSeq) SCJ30.05; hypothetical protein
sco:SCO0011 no KO assigned | (RefSeq) SCJ30.06c; hypothetical protein
sco:SCO0012 no KO assigned | (RefSeq) SCJ30.07c; hypothetical protein
sco:SCO0013 no KO assigned | (RefSeq) SCJ30.09c; hypothetical protein
sco:SCO0014 no KO assigned | (RefSeq) SCJ30.10c; hypothetical protein
sco:SCO0015 K03313 Na+:H+ antiporter, NhaA family | (RefSeq) SCJ30.11c; Na+/H+ antiporter
But this code (example to demonstrate the problem, my real geneList has thousands of genes) does not work.
geneList=c(0.5,0.1,1)
names(geneList) = c('SCO0015','SCO0033','SCO0039')
geneList = sort(geneList, decreasing = TRUE)
kk2 <- gseKEGG(geneList = geneList, organism = 'sco',
minGSSize = 1, pvalueCutoff = 1, verbose = FALSE)
It generates the error
--> Expected input gene ID: Error in check_gene_id(geneList, geneSets) : --> No gene can be mapped....
which to me suggests that it cannot find the 3 genes in the geneList in the database (I get the same error with a geneList of 7000+ genes, and the three in the example are chosen as I know they have a Knnnnn mapped to them in the KEGG database, eg SCO0015 maps to K03313 in the database extract list above).
Any ideas what I am doing wrong / how I can resolve this?
Thanks