Question: Extract Human Gene Sequences Based On Kegg Gene Name
gravatar for Dejian
8.2 years ago by
United States
Dejian1.3k wrote:

Hi, I have a list of human gene names from KEGG database, for example, ALDOA, BHLHB3, PKM2, P4HA1, EPO. I can get a list of genes with the same name in several species through searching the KEGG database, then click the one linking to the human gene hsa:226, and finally get the amino acid sequence and nucleotide sequence. Since there are hundreds of genes, this is apparently not efficient. I wonder whether there is a convenient way to finish this job. Many thanks!

kegg • 1.7k views
ADD COMMENTlink modified 8.2 years ago by Steve Moss2.3k • written 8.2 years ago by Dejian1.3k
gravatar for Steve Moss
8.2 years ago by
Steve Moss2.3k
United Kingdom
Steve Moss2.3k wrote:

Have you thought about using the KEGG API? See the following links for more information:

Also, BioRuby seems to have a pretty good API implemented:

As does the R Bioconductor KEGGSOAP package:

The following (simple) Python script should work a treat for now though ;)

#!/usr/bin/env python
Python script to retrieve KEGG gene entry for a number of different genes
Coded by Steve Moss (gawbul [at] gmail [dot] com

# import required modules
from SOAPpy import WSDL

# setup kegg wsdl
kegg_wsdl = ''
kegg_service = WSDL.Proxy(kegg_wsdl)

# setup array of gene names
gene_names = ("ALDOA", "BHLHB3", "PKM2", "P4HA1", "EPO")

# iterate of gene_names and retrieve sequences
for gene_name in gene_names:
    # use bfind first to find the list of genes for each query
    # limit to hsa (homo sapiens)
    gene_entries = kegg_service.bfind("genes " + gene_name + " hsa").rstrip("\n").split("\n") # returns str so split on \n, but remove last \n first
    print "Found %d entries for %s" % (len(gene_entries), gene_name)

    # iterate over gene_entries
    for gene_entry in gene_entries:
        # just use the first part of the string (e.g. hsa:226) to retrieve
        # the sequences in fasta format (-f)
        results = kegg_service.bget("-f " + gene_entry.split(" ")[0])
        # print results to screen
        print results

You could modify this to read the gene name entries from a file and feed them in that way, and perhaps also write the output to a file too, instead of displaying in STDOUT.

Essentially this uses the SOAP/WSDL framework to implement the equivalent of the HTTP URLs in a form readable by a computer (web service). You can build queries using the KEGG API just as you would a URL, e.g. the above "kegg_service.bget("-f hsa:" + gene_name)" is the same as calling, except the data is returned in XML to the script, rather than HTML, as it would to the browser.

ADD COMMENTlink modified 8.2 years ago • written 8.2 years ago by Steve Moss2.3k

Hi, Steve. Thank you for providing so many resources. Problem solved.

ADD REPLYlink written 8.2 years ago by Dejian1.3k

No problem :) Glad to be of assistance!

ADD REPLYlink written 8.2 years ago by Steve Moss2.3k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1114 users visited in the last hour