Question

Get All The Genes In A Pathway (E.G. Kegg)

9

Entering edit mode

12.8 years ago

Stephen 2.8k

I'm trying out a few methods for enrichment analysis for GWAS data. For this kind of analysis I need a list of pathways (or GO categories, gene families, etc), where I have a gene grouping followed by all the genes in that grouping, for example:

path1   gene1 gene2
path2   gene3 gene4 gene5 gene6
path3   gene7 gene8 gene9

I downloaded the entire hsa-pathway folder from the KEGG FTP site before it closed down, and I thought I had all this information in a file called hsa_gene_map.tab that looks like this:

2793    04062 04724 04725
2796    04912
2797    04912
2798    04080 04912
2799    00531 01100 04142

But it turns out this is the opposite of what I need - the first number is the Entrez gene ID, and the following fields on each line are KEGG pathway IDs, i.e.

gene1   path1 path2
gene2   path3 path4 path5 path6
gene3   path7 path8 path9

For KEGG, how can I get a list of pathways and all the genes in each of those pathways? I don't have a subscription to KEGG here so I can't use the FTP site any more. A list of files in a simple format as in the first example will be the most useful, since I have to read each of these lines into a list in R to do the analysis.

kegg pathway gwas gene • 30k views

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 12.8 years ago by Stephen 2.8k

0

Entering edit mode

Does it have to be KEGG? I think you should be able to get a list of genes by GO term from EnsEMBL! That is actually something I need to do in future analyses :-)

ADD REPLY • link 12.8 years ago by Steve Moss 2.3k

0

Entering edit mode

I would also like to do this for other gene groupings - e.g. gene ontology, reactome pathways, etc. The procedures described in the first link above use GO by default, but I'd like to start with KEGG first if possible.

ADD REPLY • link 12.8 years ago by Stephen 2.8k

Ram · Answer 1 · 2011-10-28

In R you can accomplish this as below:

library(org.Hs.eg.db)
kegg <- org.Hs.egPATH2EG
mapped <- mappedkeys(kegg)
kegg2 <- as.list(kegg[mapped])

This will give you a list (in the R sense) where the list names are KEGG Pathway IDs and the list elements are EntrezGene IDs. It does of course depend on the presence of an org.ORG.eg.db package for your species of interest. HTH duff

score 5 · Answer 2 · 2011-07-07

5

Entering edit mode

12.8 years ago

Tg ▴ 310

Since you mention that you need to do enrichment with reactome pathway and several data source, I suggest that you should get data from http://www.broadinstitute.org/gsea/downloads.jsp, it contains pathways-gene data from kegg, reactome, biocarta and some GO as well.

ADD COMMENT • link 12.8 years ago by Tg ▴ 310

0

Entering edit mode

This is exactly what I wanted. Thanks. Parsed using sed -r 's/\s+/,/' | sed -r 's/\s+/,/' | sed -r 's/\s+/ /g' for input for SNPath http://linchen.fhcrc.org/grass.html

ADD REPLY • link 12.8 years ago by Stephen 2.8k

Ram · Answer 3 · 2011-07-07

Why can't you just figure out the answer from what you have? If you have the list of all genes and which pathways they are in then by definition by have a list of all pathways and which genes are in them.

Simply use a programming language to 'transpose' the data you have. I don't have any R knowledge but this python script can do it very easily:

from collections import defaultdict

with open('path/to/file') as handle:
    gene2pathway = defaultdict(set)
    for line in handle:
        parts = line.strip().split()
        gene2pathway[parts[0]] = set(parts[1:])

pathway2gene = defaultdict(set)
for gene, pathways in gene2pathway.items():
    for path in pathways:
        pathway2gene[path].add(gene)

for gene, pathways in pathway2gene.items():
    print gene, ' '.join(sorted(pathways))

I'm sure someone with more R knowledge then myself could do this just as easily.

Ram · Answer 4 · 2011-07-07

3

Entering edit mode

12.8 years ago

Khader Shameer 18k

You can use KEGG API to get all genes in a KEGG pathway. You can access KEGG API via Perl, Ruby, Python or Java. See the tutorial section on Pathways here.

You have to use the function get_genes_by_pathway(string:pathway_id)

Details about various functions / methods and examples scripts are available here.

Also check this discussion on gene-pathway association file from KEGG as well.

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 12.8 years ago by Khader Shameer 18k

0

Entering edit mode

why use the API .. if you already have the hsa_gene_map file then you already have all of the mappings?

ADD REPLY • link 12.8 years ago by Will 4.5k

0

Entering edit mode

@Will: Simple, if there is an API use it, instead of raw data munging :).

ADD REPLY • link 12.8 years ago by Khader Shameer 18k

0

Entering edit mode

@Will: Simple, if there is an API, I prefer to use it instead of raw data munging :).

ADD REPLY • link 12.8 years ago by Khader Shameer 18k

score 3 · Answer 5 · 2011-07-07

We've just added some features to Biocyc that make it easier to produce these types of lists, for instance here's a group of ecoli pathways and associated genes that I just put together:

This is exportable; it may not be in the format you want but suggestions for extensions are welcome.

Note you can also do enrichment analysis on this group right on the site.

Ram · Answer 6 · 2011-10-28

For GO, you can use the un-exported function getGO2ExtID of my package clusterProfiler

For instance:

clusterProfiler:::getGO2ExtID("GO:0044419", organism="human")
$GO:0044419
  [1] "213"    "259"    "274"    "290"    "291"    "292"    "293"    "336" 
  [9] "358"    "373"    "387"    "527"    "664"    "665"    "699"    "708" 
 [17] "735"    "820"    "857"    "904"    "905"    "912"    "919"    "920" 
 [25] "941"    "942"    "948"    "949"    "975"    "995"    "1050"   "1058"
...

For KEGG, you can use bioconductor package KEGG.db

require(KEGG)
Path2ExtID <- mget(PathID, KEGGPATHID2EXTID, ifnotfound=NA)
Path2ExtID <- lapply(Path2ExtID, function(i) unique(i))

Ram · Answer 7 · 2011-07-07

2

Entering edit mode

12.8 years ago

Chris Evelo 10k

If you are interested in other resources as well you could use the WikiPathways webservices to get a list of pathways and next a list of genes (names and IDs) per pathway. See: http://www.wikipathways.org/index.php/Help:WikiPathways_Webservice

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 12.8 years ago by Chris Evelo 10k

score 0 · Answer 8 · 2011-08-24

0

Entering edit mode

12.7 years ago

Robin Haw ▴ 170

There is a Reactome gene set available for downloading from our 'Download' page. As mentioned already it is also part of the GSEA gene sets available from the BROAD.

ADD COMMENT • link 12.7 years ago by Robin Haw ▴ 170

Ram · Answer 9 · 2011-08-25

I had to do the same job inside a C/C++ program. I downloaded signature files from Broad (as Tg): http://www.broadinstitute.org/gsea/downloads.jsp Then I parsed and analyzed with the following code:

// input:
// geneIds: set of genes to look for
// filename:gsea filename
// cutoff:  min. number of genes to match
// pLimit:  min. desired significance
// output:
// genesetResult
// genesetP

    static inline int overlapGeneSet(const set<string> &geneIds, const string &filename, int cutoff, double pLimit, vector< vector<string> > &genesetResult, vector<double> &genesetP){
        const int BufSize(100000); // oversized input row buffer
        char *buffer = (char *)malloc( BufSize );
        int result(0);
        ifstream gsea(filename.c_str());
        string strVal;    
        char delimiter = '\t';

        int gseaSize;                   // signature size
        string gseaName;                // signature name
        string gseaSource;              // signature desc.
        vector<string> gseaCommonGenes; // number of matching genes

        // foreach row/geneset
        while(!gsea.eof()){
            istringstream strstream;
            gsea.getline(buffer, BufSize);
            strstream.str(buffer);
            gseaCommonGenes.clear();
            gseaSize = -2;
            gseaName = "";
            gseaSource = "";
            // foreach field in geneset
            while(!strstream.eof()){
                strVal = "";
                getline(strstream, strVal, delimiter);
                if(gseaSize==-2)
                    gseaName = strVal;
                if(gseaSize==-1)
                    gseaSource = strVal;
                if(gseaSize >= 0){
                    if(geneIds.find(strVal) != geneIds.end()){
                        gseaCommonGenes.push_back(strVal);
                    }
                }
                gseaSize++;
            }
            if(gseaCommonGenes.size() >= cutoff){
                result++;
                double P = <your enrichment test here>; 
                        // e.g. hypg(NGenes, gseaSize, geneIds.size(), seaCommonGenes.size());
                if(P < pLimit){
                    gseaCommonGenes.push_back(gseaName);
                    genesetResult.push_back(gseaCommonGenes);
                    genesetP.push_back(P);
                }
            }
        }
        free(buffer);
        return result;
    }

You only have to set your preferred enrichment condition at the line where I assign a value to P for each parsed geneset.

Ram · Answer 10 · 2011-10-28

0

Entering edit mode

12.5 years ago

Andrew W ▴ 290

Perhaps ugly, but here's a Perl one-liner you can run on the command line that will do this type of data transpose:

cat hsa_gene_map.tab | perl -e 'while(<>) { chomp; @d = split(/\s+/, $_); $g = shift @d; map { $map{$_}{$g}++ } @d; } for $p (sort keys %map) { print join("\t", $p, sort keys %{ $map{$p} }), "\n" }' > hsa_gene_map2.tab

ADD COMMENT • link updated 5.6 years ago by Ram 43k • written 12.5 years ago by Andrew W ▴ 290