Question: Get All The Genes In A Pathway (E.G. Kegg)
8
gravatar for Stephen
7.0 years ago by
Stephen2.6k
Charlottesville Virginia
Stephen2.6k wrote:

I'm trying out a few methods for enrichment analysis for GWAS data. For this kind of analysis I need a list of pathways (or GO categories, gene families, etc), where I have a gene grouping followed by all the genes in that grouping, for example:

path1   gene1 gene2
path2   gene3 gene4 gene5 gene6
path3   gene7 gene8 gene9

I downloaded the entire hsa-pathway folder from the KEGG FTP site before it closed down, and I thought I had all this information in a file called hsa_gene_map.tab that looks like this:

2793    04062 04724 04725
2796    04912
2797    04912
2798    04080 04912
2799    00531 01100 04142

But it turns out this is the opposite of what I need - the first number is the Entrez gene ID, and the following fields on each line are KEGG pathway IDs, i.e.

gene1   path1 path2
gene2   path3 path4 path5 path6
gene3   path7 path8 path9

For KEGG, how can I get a list of pathways and all the genes in each of those pathways? I don't have a subscription to KEGG here so I can't use the FTP site any more. A list of files in a simple format as in the first example will be the most useful, since I have to read each of these lines into a list in R to do the analysis.

pathway gwas kegg gene • 20k views
ADD COMMENTlink written 7.0 years ago by Stephen2.6k

Does it have to be KEGG? I think you should be able to get a list of genes by GO term from EnsEMBL! That is actually something I need to do in future analyses :-)

ADD REPLYlink written 7.0 years ago by Steve Moss2.2k

I would also like to do this for other gene groupings - e.g. gene ontology, reactome pathways, etc. The procedures described in the first link above use GO by default, but I'd like to start with KEGG first if possible.

ADD REPLYlink written 7.0 years ago by Stephen2.6k
7
gravatar for Duff
6.7 years ago by
Duff660
United Kingdom
Duff660 wrote:

In R you can accomplish this as below:

library(org.Hs.eg.db)
kegg <- org.Hs.egPATH2EG
mapped <- mappedkeys(kegg)
kegg2 <- as.list(kegg[mapped])

This will give you a list (in the R sense) where the list names are KEGG Pathway IDs and the list elements are EntrezGene IDs. It does of course depend on the presence of an org.ORG.eg.db package for your species of interest.

HTH

duff

ADD COMMENTlink written 6.7 years ago by Duff660
5
gravatar for Tg
7.0 years ago by
Tg310
Thailand
Tg310 wrote:

Since you mention that you need to do enrichment with reactome pathway and several data source, I suggest that you should get data from http://www.broadinstitute.org/gsea/downloads.jsp, it contains pathways-gene data from kegg, reactome, biocarta and some GO as well.

ADD COMMENTlink written 7.0 years ago by Tg310

This is exactly what I wanted. Thanks. Parsed using sed -r 's/\s+/,/' | sed -r 's/\s+/,/' | sed -r 's/\s+/ /g' for input for SNPath http://linchen.fhcrc.org/grass.html

ADD REPLYlink written 7.0 years ago by Stephen2.6k
4
gravatar for Will
7.0 years ago by
Will4.4k
United States
Will4.4k wrote:

Why can't you just figure out the answer from what you have? If you have the list of all genes and which pathways they are in then by definition by have a list of all pathways and which genes are in them.

Simply use a programming language to 'transpose' the data you have. I don't have any R knowledge but this python script can do it very easily:

from collections import defaultdict

with open('path/to/file') as handle:
    gene2pathway = defaultdict(set)
    for line in handle:
        parts = line.strip().split()
        gene2pathway[parts[0]] = set(parts[1:])

pathway2gene = defaultdict(set)
for gene, pathways in gene2pathway.items():
    for path in pathways:
        pathway2gene[path].add(gene)

for gene, pathways in pathway2gene.items():
    print gene, ' '.join(sorted(pathways))

I'm sure someone with more R knowledge then myself could do this just as easily.

ADD COMMENTlink written 7.0 years ago by Will4.4k
3
gravatar for Khader Shameer
7.0 years ago by
Manhattan, NY
Khader Shameer17k wrote:

You can use KEGG API to get all genes in a KEGG pathway. You can access KEGG API via Perl, Ruby, Python or Java. See the tutorial section on Pathways here.

You have to use the function get_genes_by_pathway(string:pathway_id)

Details about various functions / methods and examples scripts are available here.

Also check this discussion on gene-pathway association file from KEGG as well.

ADD COMMENTlink modified 7.0 years ago • written 7.0 years ago by Khader Shameer17k

why use the API .. if you already have the hsa_gene_map file then you already have all of the mappings?

ADD REPLYlink written 7.0 years ago by Will4.4k

@Will: Simple, if there is an API use it, instead of raw data munging :).

ADD REPLYlink written 7.0 years ago by Khader Shameer17k

@Will: Simple, if there is an API, I prefer to use it instead of raw data munging :).

ADD REPLYlink written 7.0 years ago by Khader Shameer17k
3
gravatar for Mt
7.0 years ago by
Mt60
Bay Area
Mt60 wrote:

We've just added some features to Biocyc that make it easier to produce these types of lists, for instance here's a group of ecoli pathways and associated genes that I just put together:

This is exportable; it may not be in the format you want but suggestions for extensions are welcome.

Note you can also do enrichment analysis on this group right on the site.

ADD COMMENTlink written 7.0 years ago by Mt60
3
gravatar for Guangchuang Yu
6.7 years ago by
Guangchuang Yu1.9k
China/Hong Kong/The University of Hong Kong
Guangchuang Yu1.9k wrote:

For GO, you can use the un-exported function getGO2ExtID of my package clusterProfiler http://www.bioconductor.org/packages/devel/bioc/html/clusterProfiler.html

For instance:

clusterProfiler:::getGO2ExtID("GO:0044419", organism="human")

$GO:0044419 [1] "213" "259" "274" "290" "291" "292" "293" "336"
[9] "358" "373" "387" "527" "664" "665" "699" "708"
[17] "735" "820" "857" "904" "905" "912" "919" "920"
[25] "941" "942" "948" "949" "975" "995" "1050" "1058"
...

For KEGG, you can use bioconductor package KEGG.db

require(KEGG)

Path2ExtID <- mget(PathID, KEGGPATHID2EXTID, ifnotfound=NA)

Path2ExtID <- lapply(Path2ExtID, function(i) unique(i))

ADD COMMENTlink modified 6.7 years ago • written 6.7 years ago by Guangchuang Yu1.9k
1

out-dated, see A: Collecting Genes with similar GO term

ADD REPLYlink written 19 months ago by Guangchuang Yu1.9k
1
gravatar for Chris Evelo
7.0 years ago by
Chris Evelo9.9k
Maastricht, The Netherlands
Chris Evelo9.9k wrote:

If you are interested in other resources as well you could use the WikiPathways webservices to get a list of pathways and next a list of genes (names and IDs) per pathway. See: http://www.wikipathways.org/index.php/Help:WikiPathways_Webservice

ADD COMMENTlink written 7.0 years ago by Chris Evelo9.9k
0
gravatar for Robin Haw
6.9 years ago by
Robin Haw160
Toronto
Robin Haw160 wrote:

There is a Reactome gene set available for downloading from our 'Download' page. As mentioned already it is also part of the GSEA gene sets available from the BROAD.

ADD COMMENTlink written 6.9 years ago by Robin Haw160
0
gravatar for ff.cc.cc
6.9 years ago by
ff.cc.cc1.2k
European Union
ff.cc.cc1.2k wrote:

I had to do the same job inside a C/C++ program.

I downloaded signature files from Broad (as Tg): http://www.broadinstitute.org/gsea/downloads.jsp

Then I parsed and analyzed with the following code:

// input:
// geneIds: set of genes to look for
// filename:gsea filename
// cutoff:  min. number of genes to match
// pLimit:  min. desired significance
// output:
// genesetResult
// genesetP

    static inline int overlapGeneSet(const set<string> &geneIds, const string &filename, int cutoff, double pLimit, vector< vector<string> > &genesetResult, vector<double> &genesetP){
        const int BufSize(100000); // oversized input row buffer
        char *buffer = (char *)malloc( BufSize );
        int result(0);
        ifstream gsea(filename.c_str());
        string strVal;    
        char delimiter = '\t';

        int gseaSize;                   // signature size
        string gseaName;                // signature name
        string gseaSource;              // signature desc.
        vector<string> gseaCommonGenes; // number of matching genes

        // foreach row/geneset
        while(!gsea.eof()){
            istringstream strstream;
            gsea.getline(buffer, BufSize);
            strstream.str(buffer);
            gseaCommonGenes.clear();
            gseaSize = -2;
            gseaName = "";
            gseaSource = "";
            // foreach field in geneset
            while(!strstream.eof()){
                strVal = "";
                getline(strstream, strVal, delimiter);
                if(gseaSize==-2)
                    gseaName = strVal;
                if(gseaSize==-1)
                    gseaSource = strVal;
                if(gseaSize >= 0){
                    if(geneIds.find(strVal) != geneIds.end()){
                        gseaCommonGenes.push_back(strVal);
                    }
                }
                gseaSize++;
            }
            if(gseaCommonGenes.size() >= cutoff){
                result++;
                double P = <your enrichment test here>; 
                        // e.g. hypg(NGenes, gseaSize, geneIds.size(), seaCommonGenes.size());
                if(P < pLimit){
                    gseaCommonGenes.push_back(gseaName);
                    genesetResult.push_back(gseaCommonGenes);
                    genesetP.push_back(P);
                }
            }
        }
        free(buffer);
        return result;
    }

You only have to set your preferred enrichment condition at the line where I assign a value to P for each parsed geneset.

ADD COMMENTlink written 6.9 years ago by ff.cc.cc1.2k
0
gravatar for Andrew W
6.7 years ago by
Andrew W290
Andrew W290 wrote:

Perhaps ugly, but here's a Perl one-liner you can run on the command line that will do this type of data transpose:

cat hsa_gene_map.tab | perl -e 'while(<>) { chomp; @d = split(/\s+/, $_); $g = shift @d; map { $map{$_}{$g}++ } @d; } for $p (sort keys %map) { print join("\t", $p, sort keys %{ $map{$p} }), "\n" }' > hsa_gene_map2.tab
ADD COMMENTlink modified 6.7 years ago • written 6.7 years ago by Andrew W290
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1411 users visited in the last hour