Question: PFAM, KEGG enrichment for non model organism
gravatar for mjp
3.4 years ago by
mjp10 wrote:

Hi there,

There are couple of posts circulating around but I couldn't find definitive answer for a non-model organism scenario.

How would one go about finding out which terms, being that PFAM, KEGG or others, are enriched in a group of genes of interest, provided the universe as a background to calculate the enrichment from?

I am familiar with topGO approach that can accept the genes of interest in a simple tab-delimited format of IDs of some kind (might be made up names) and universe as the same ID with GOid simply listed on the same row, separated by comma.


gene1 GO:0003677, GO:0004803, GO:0006313 ...

gene2 GO:0000160, GO:0003677, GO:0000160 ...


genes of interest:



I've found myself wondering whether there is a package that would be able to take any kind of terms (PFAM, KEGG, GO, XX) and find whether a subset of IDs of interest is significantly enriched within a broader set. Annotations could happen at later stage.

Any assistance, suggestions, pointers would be appreciated.

ADD COMMENTlink modified 3.3 years ago • written 3.4 years ago by mjp10
gravatar for Lars Juhl Jensen
3.4 years ago by
Copenhagen, Denmark
Lars Juhl Jensen11k wrote:

I do not know a tool that would do precisely what you describe, i.e. to let you specify the annotations for all the genes and do enrichment analysis with that.

However, if what you want is just to look for enriched KEGG maps and protein domains, you could use the enrichment functionality in STRING. Just go to the website, select "Multiple proteins", paste in the names of your genes of interest, select your organism, and click through till you get a network. On the network page, click the "Analysis" tab below the network to show the enrichment results. STRING does not cover every sequenced organism, but with more than 2000 genomes in the current version, it covers a lot more than just model organisms. So if your organism of interest is among them, it would seem the easy solution.

ADD COMMENTlink written 3.4 years ago by Lars Juhl Jensen11k

Thank you for your suggestions. STRING looks quite impressive.

I'm currently looking at some novel microbes and fungi. I do the gene calling myself, so most of the genes are not initially publicly identifiable. I could perhaps use a sequence as an input to STRING but I would like to do things in a high-throughput manner. Any predefined organism set would rather not suit me. I also don't see why any enrichment method should rely on any organism other that purely for the purpose of predefined set of gene universe.

Thank you!

ADD REPLYlink written 3.4 years ago by mjp10
gravatar for mjp
3.3 years ago by
mjp10 wrote:

I have decided I will use the most general approach that does not depend on any third party software - Fisher test. Using Fisher is fairly straight forward and applicable to any sort of database.

Thanks to all that contributed.

ADD COMMENTlink written 3.3 years ago by mjp10

You're very welcome, but Fisher's exact test will only get you the first step. Don't forget correction for multiple testing.

ADD REPLYlink written 3.3 years ago by Lars Juhl Jensen11k

I do adjust my p.value for multiple testing :) I use standard R packages to achieve this. Thanks!

ADD REPLYlink written 3.3 years ago by mjp10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1647 users visited in the last hour