I have a list of genes and I want to know which metabolites are assigned to those genes. is there anyone who can tell me what are the options? I prefer to have access to a database rather than web app so feel free to comment on any package or script etc
Based on this, with some modifications:
library(KEGGREST)
library(org.Hs.eg.db)
library(annotate)
## Get enzyme-gene annotations
res1 = keggLink("enzyme", "hsa")
tmpDF1 = data.frame(ec = res1, gene = names(res1))
## Get compound-enzyme annotations
res2 = keggLink("compound", "enzyme")
tmpDF2 = data.frame(cpd = res2, ec = names(res2))
## Merge
df = merge(tmpDF1, tmpDF2, by="ec")
## aggDF = aggregate(df['ec'], by = df['gene'], FUN=paste) ## Optional, see original link
## Convert KEGG gene IDs to Entrez Gene IDs
convs = keggConv("hsa", "ncbi-geneid")
names(convs) = as.character(gsub("ncbi-geneid:", "", names(convs)))
df$ncbi_id = names(convs)[match(df$gene, as.character(convs))]
df$ncbi_name = getSYMBOL(df$ncbi_id, "org.Hs.eg.db")
## Convert compound IDs to compound (metabolite) names
mets = keggList("compound")
mets = setNames(str_split(mets, ";", simplify = T)[,1], names(mets))
df$cpd_name = as.character(mets)[match(df$cpd, names(mets))]
## Make a list of metabolite gene sets. Each metabolite gene set is composed of enzymes/genes involved in their metabolism
metabolites.gs = lapply(1:length(unique(df$cpd_name)), function(x) df$ncbi_name[which(df$cpd_name == unique(df$cpd_name)[x])])
namesmetabolites.gs) = unique(df$cpd_name)
metabolites.gs$'D-Glyceraldehyde 3-phosphate' ## test
## Taking your df from your original post, create ranks.
ranks = setNames(df$logFC, df$X)
## Now you have both the gene sets and the ranks. You can run GSEA
res = fgseametabolites.gs, ranks, nperm = 1000)
Note: depending on the output from your differential expression analysis, I would use (i) the t-statistic, which is simply logFC / sd(logFC); (ii) the logFC, as I show above; or (iii) separate your df into 'up' or 'down' regulated, and use the p-values. I'm not an expert on this so I'm not sure which option is best, but you can play around with different analyses.
EDIT: For some reason my parenthesis are not showing: the proper code is "names(metabolites.gs)" and "fgsea(metabolites.gs)", without the asterisks.
@ivanarg2 thanks for your help. how then I can use my gene list to find the metabolites?
You can then run regular gene set enrichment analysis, as below:
fgseametabolites.gs, ranks, nperm = 1000) ## ranks is from your differential expression analysis, i.e., a t-statistic (logFC/logFC.sd)
The resulting list of 'pathways' (really, metabolites), will tell you the enrichment of metabolites in different conditions. Is this what you were asking?
@ivanarg2 here is what I undestood. You extract the gene and metabolits assigned to them from KEGG. Then I have a list of genes which are upregulated and I already did the differential experission , so I have their names. I want to just see what metabolites are upregulated with the list of the genes that I have . Is that clear?
Right. So you already have the two necessary parameters to run GSEA: (i) a list of gene sets (which usually refers to a biological pathway and the genes associated with those pathways, but in this case the gene sets refer to metabolites, and the genes involved in their metabolism); and (ii) a ranking of genes based on their differential expression. Just run fgsea on your list of differentially expressed genes and metabolites.gs.
@ivanarg2 Can you help me to apply it on this list ? "GPX3", "GLRX", "LBP", "CRYAB", "DEFB1", "HCLS1", "SOD2", "HSPA2", "ORM1", "IGFBP1", "PTHLH", "GPC3", "IGFBP3","TOB1", "MITF", "NDRG1", "NR1H4", "FGFR3", "PVR", "IL6", "PTPRM", "ERBB2", "NID2", "LAMB1", "COMP", "PLS3", "MCAM", "SPP1", "LAMC1", "COL4A2", "COL4A1", "MYOC", "ANXA4", "TFPI2", "CST6", "SLPI", "TIMP2", "CPM", "GGT1", "NNMT", "MAL", "EEF1A2", "HGD", "TCN2", "CDA", "PCCA", "CRYM", "PDXK","STC1", "WARS", "HMOX1", "FXYD2", "RBP4", "SLC6A12", "KDELR3", "ITM2B"
To run GSEA you need a vector with (i) gene names, as you provide, and (ii) some kind of ranking variable, like a fold change, a p value, or something like that. Do you have this info from your differential expression analysis?
Check section 4.5.2 here: https://bioconductor.org/packages/release/bioc/vignettes/piano/inst/doc/piano-vignette.pdf and reference 6 therein. They describe making 'gene sets' where each gene set is a metabolite, and the genes within each set are the enzymes related to metabolism of that metabolite. I'm interested in this as well, let me know how it goes for you.
Apart from what appears to be a good answer by ivanarg2, you may also find the functions of MetaboAnalyst useful. Certain functions can work with genes.
Kevin