I have a list of genes and I want to know which metabolites are assigned to those genes. is there anyone who can tell me what are the options? I prefer to have access to a database rather than web app so feel free to comment on any package or script etc
Based on this, with some modifications:
library(KEGGREST) library(org.Hs.eg.db) library(annotate) ## Get enzyme-gene annotations res1 = keggLink("enzyme", "hsa") tmpDF1 = data.frame(ec = res1, gene = names(res1)) ## Get compound-enzyme annotations res2 = keggLink("compound", "enzyme") tmpDF2 = data.frame(cpd = res2, ec = names(res2)) ## Merge df = merge(tmpDF1, tmpDF2, by="ec") ## aggDF = aggregate(df['ec'], by = df['gene'], FUN=paste) ## Optional, see original link ## Convert KEGG gene IDs to Entrez Gene IDs convs = keggConv("hsa", "ncbi-geneid") names(convs) = as.character(gsub("ncbi-geneid:", "", names(convs))) df$ncbi_id = names(convs)[match(df$gene, as.character(convs))] df$ncbi_name = getSYMBOL(df$ncbi_id, "org.Hs.eg.db") ## Convert compound IDs to compound (metabolite) names mets = keggList("compound") mets = setNames(str_split(mets, ";", simplify = T)[,1], names(mets)) df$cpd_name = as.character(mets)[match(df$cpd, names(mets))] ## Make a list of metabolite gene sets. Each metabolite gene set is composed of enzymes/genes involved in their metabolism metabolites.gs = lapply(1:length(unique(df$cpd_name)), function(x) df$ncbi_name[which(df$cpd_name == unique(df$cpd_name)[x])]) namesmetabolites.gs) = unique(df$cpd_name) metabolites.gs$'D-Glyceraldehyde 3-phosphate' ## test ## Taking your df from your original post, create ranks. ranks = setNames(df$logFC, df$X) ## Now you have both the gene sets and the ranks. You can run GSEA res = fgseametabolites.gs, ranks, nperm = 1000)
Note: depending on the output from your differential expression analysis, I would use (i) the t-statistic, which is simply logFC / sd(logFC); (ii) the logFC, as I show above; or (iii) separate your df into 'up' or 'down' regulated, and use the p-values. I'm not an expert on this so I'm not sure which option is best, but you can play around with different analyses.
EDIT: For some reason my parenthesis are not showing: the proper code is "names(metabolites.gs)" and "fgsea(metabolites.gs)", without the asterisks.