Question: How To Find The Pathways In Which A Given Gene Or Protein Is Involved?
gravatar for K-Li
8.9 years ago by
K-Li80 wrote:

It's like the title says, I have some gene names, I want to know the pathways that they are involved in. Which method may work?thanks

gene pathway • 16k views
ADD COMMENTlink modified 4.5 years ago by Endre Bakken Stovner880 • written 8.9 years ago by K-Li80
gravatar for Chris Evelo
8.9 years ago by
Chris Evelo10.0k
Maastricht, The Netherlands
Chris Evelo10.0k wrote:

The first thing that comes to my mind is that you could search WikiPathways. Just copy your gene or protein in the search box and here you go. If you have to search for many genes you could use the WikiPathways webservices.

Since WikiPathways now contains Reactome you will find the Reactome pathways in this way as well, but of course you could search Reactome separately. You could use PathVisio, which is not only the editor applet of WikiPathyways but also a standalone pathway tool, to also search the converted KEGG pathways which you can download from the PathVisio site. But that will not give better results than what Michael suggested for searching KEGG directly.

Also check Pathwaycommons. They cover a lot of pathway resources and have a nice search feature. Their content includes a.o.: BioGRID, HumanCyc, MetaCyc, MINT, IntAct, the NCI/Nature pathway interaction database and Reactome.

Finally you might want to search your gene in GO. Many gene classes in GO actually are pathways or at least the genes in that class are clearly related to a specific biological pathway. So in that way you might find a few pathways where your gene does belong to, or is related to, while it is not really covered in the pathway itself yet.

There also are a number of species specific pathway resources. How useful these are of course depends on what species your genes are from.

Update: WikiPathways content is now also available as downloadable RDF and can be accessed through a SPARQL endpoint. Examples of useful SPARQL queries can be found here. These include queries to find all pathways containing a specific gene.

ADD COMMENTlink modified 4.5 years ago • written 8.9 years ago by Chris Evelo10.0k
gravatar for Michael Dondrup
8.9 years ago by
Bergen, Norway
Michael Dondrup47k wrote:


ADD COMMENTlink written 8.9 years ago by Michael Dondrup47k
gravatar for Stew
8.9 years ago by
Stew1.4k wrote:

I would recommend DAVID for an easy way to go from gene lists to functional information, such as pathways. It contains lots of the databases mentioned by other people here and is very well documented and highly cited.

ADD COMMENTlink written 8.9 years ago by Stew1.4k

It is a good idea to check the latest update date of DAVID before you use that. Currently that is from January 2010.

ADD REPLYlink written 4.5 years ago by Chris Evelo10.0k
gravatar for
8.5 years ago by
European Union wrote:

David and GSEA are my preferred online.

But I had to do the same job inside a C/C++ program.

I downloaded signature files from Broad institute:

Then I parsed and analyzed with the following code:

// input:
// geneIds: set of genes to look for
// filename:gsea filename
// cutoff:  min. number of genes to match
// pLimit:  min. desired significance
// output:
// genesetResult
// genesetP

    static inline int overlapGeneSet(const set<string> &geneIds, const string &filename, int cutoff, double pLimit, vector< vector<string> > &genesetResult, vector<double> &genesetP){
        const int BufSize(100000); // oversized input row buffer
        char *buffer = (char *)malloc( BufSize );
        int result(0);
        ifstream gsea(filename.c_str());
        string strVal;    
        char delimiter = '\t';

        int gseaSize;                   // signature size
        string gseaName;                // signature name
        string gseaSource;              // signature desc.
        vector<string> gseaCommonGenes; // number of matching genes

        // foreach row/geneset
            istringstream strstream;
            gsea.getline(buffer, BufSize);
            gseaSize = -2;
            gseaName = "";
            gseaSource = "";
            // foreach field in geneset
                strVal = "";
                getline(strstream, strVal, delimiter);
                    gseaName = strVal;
                    gseaSource = strVal;
                if(gseaSize >= 0){
                    if(geneIds.find(strVal) != geneIds.end()){
            if(gseaCommonGenes.size() >= cutoff){
                double P = <your enrichment test here>; 
                        // e.g. hypg(NGenes, gseaSize, geneIds.size(), seaCommonGenes.size());
                if(P < pLimit){
        return result;

If you call a routine for setting enrichment-scoreat the line where I assign a value to P.

Look also at:

ADD COMMENTlink modified 5 months ago by RamRS25k • written 8.5 years ago by

Hey What signature files are you talking about? I want to use your methodology of identification, could you tell me what archives did you use? thanks

ADD REPLYlink written 3.3 years ago by soza10

This is question is 5 years old and hasn't been on biostars since 2 years and 3 months ago, so I wouldn't count on an answer here! ;)

ADD REPLYlink written 3.3 years ago by WouterDeCoster43k
gravatar for Dataminer
8.6 years ago by
Dataminer2.7k wrote:

GSEA from broad institute.

ADD COMMENTlink written 8.6 years ago by Dataminer2.7k
gravatar for Endre Bakken Stovner
4.5 years ago by
Endre Bakken Stovner880 wrote:

I need to do this from the command line rather often, so I wrote a script to query KEGG called kg.

$ echo "Gna14" | kg -m 0 -q -s rno -d --noheader -
Gna14    04020    Calcium signaling pathway
Gna14    05142    Chagas disease (American trypanosomiasis)
Gna14    05146    Amoebiasis

You can also go the other way and get genes from pathway ids. 

To explain the command line:

-m 0 # join on column 0 (there is only one gene name, hence only one column to join on.)

-s rno # the gene identifiers are for rattus norvegicus (use hsa for human and mmu for mouse)

-d # add definitions (the human readable part in the third column)

-q # quiet, do not show progress info on stderr

 One advantage of using kg is that kg stores the data locally so subsequent queries are instantaneous.

Install with 

pip install kg

See for more.

The command line interface:


Get KEGG data from the command line.
(Visit for examples and help.)

    kg --help
    kg --mergecol=COL --species=SPEC [--genes] [--definitions] [--noheader] [--quiet] FILE
    kg --species=SPEC [--definitions] [--quiet]
    kg --removecache

    FILE                    infile to add KEGG data to (read STDIN with -)
    -s SPEC --species=SPEC  name of species (examples: hsa, mmu, rno...)
    -m COL --mergecol=COL  column (0-indexed int or name) containing gene names

    -h --help               show this message
    -q --quiet              do not show progress messages on stderr
    -n --noheader           the input data does not contain a header
    -d --definitions        add KEGG pathway definitions to the output
    -g --genes              get the genes related to KEGG pathways
                            (when used, mergecol COL should contain KEGG pathway
    --removecache           removes the local cache so that the KEGG REST DB is
                            accessed anew


    Write all KEGG info to STDOUT for "Rattus Norvegicus":

        kg --species rno

    Get all human pathways associated with the genes in column called "Gene" in
    test.txt, merge them to the file, add pathway definitions and write to STDOUT

        kg -s hsa -m Gene -d test.txt
ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by Endre Bakken Stovner880
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1570 users visited in the last hour