Question

gseKEGG with streptomyces coelicolor - No gene can be mapped

1

Entering edit mode

13 months ago

r.evans2 ▴ 10

I am trying to use the gseKEGG function in the R package clusterProfiler with the KEGG database entries for Streptomyces coelicolor.

This code snippet confirms that KEGG supports this organism, with kegg_code of 'sco'.

search_kegg_organism('sco', by='kegg_code')

returns

kegg_code scientific_name common_name

2810 scon Streptococcus constellatus subsp. pharyngis C232 <NA>

2811 scos Streptococcus constellatus subsp. pharyngis C818 <NA>

3495 sco Streptomyces coelicolor <NA>

Examining the database T00085 at https://www.genome.jp/dbget-bin/get_linkdb?-t+genes+gn:T00085 seems to confirm that the format for the gene ids in the KEGG database is the widely used SCOnnnn, which happily is the format I use in my datasets - first few lines from the database replicated below :

sco:SCO0001 no KO assigned | (RefSeq) SCEND.02c; hypothetical protein

sco:SCO0002 no KO assigned | (RefSeq) SC8E7.42c, SCEND.01c, SCJ24.01c; hypothetical protein

sco:SCO0003 no KO assigned | (RefSeq) SC8E7.41c; DNA-binding protein

sco:SCO0004 no KO assigned | (RefSeq) SC1C9.01, SC8E7.40c; hypothetical protein

sco:SCO0005 no KO assigned | (RefSeq) SC1C9.02; transposase

sco:SCO0006 no KO assigned | (RefSeq) SC1C9.03, SCJ30.01; ATP/GTP-binding protein

sco:SCO0007 no KO assigned | (RefSeq) SCJ30.02c; hypothetical protein

sco:SCO0008 no KO assigned | (RefSeq) SCJ30.03c; hypothetical protein

sco:SCO0009 no KO assigned | (RefSeq) SCJ30.04c; hypothetical protein

sco:SCO0010 no KO assigned | (RefSeq) SCJ30.05; hypothetical protein

sco:SCO0011 no KO assigned | (RefSeq) SCJ30.06c; hypothetical protein

sco:SCO0012 no KO assigned | (RefSeq) SCJ30.07c; hypothetical protein

sco:SCO0013 no KO assigned | (RefSeq) SCJ30.09c; hypothetical protein

sco:SCO0014 no KO assigned | (RefSeq) SCJ30.10c; hypothetical protein

sco:SCO0015 K03313 Na+:H+ antiporter, NhaA family | (RefSeq) SCJ30.11c; Na+/H+ antiporter

But this code (example to demonstrate the problem, my real geneList has thousands of genes) does not work.

geneList=c(0.5,0.1,1)

names(geneList) = c('SCO0015','SCO0033','SCO0039')

geneList = sort(geneList, decreasing = TRUE)

kk2 <- gseKEGG(geneList = geneList, organism = 'sco',
minGSSize = 1, pvalueCutoff = 1, verbose = FALSE)

It generates the error

--> Expected input gene ID: Error in check_gene_id(geneList, geneSets) : --> No gene can be mapped....

which to me suggests that it cannot find the 3 genes in the geneList in the database (I get the same error with a geneList of 7000+ genes, and the three in the example are chosen as I know they have a Knnnnn mapped to them in the KEGG database, eg SCO0015 maps to K03313 in the database extract list above).

Any ideas what I am doing wrong / how I can resolve this?

Thanks

streptomyces KEGG clusterProfiler • 470 views

ADD COMMENT • link 13 months ago by r.evans2 ▴ 10