Extract Human Gene Sequences Based On Kegg Gene Name
1
1
Entering edit mode
12.6 years ago
Dejian ★ 1.3k

Hi, I have a list of human gene names from KEGG database, for example, ALDOA, BHLHB3, PKM2, P4HA1, EPO. I can get a list of genes with the same name in several species through searching the KEGG database, then click the one linking to the human gene hsa:226, and finally get the amino acid sequence and nucleotide sequence. Since there are hundreds of genes, this is apparently not efficient. I wonder whether there is a convenient way to finish this job. Many thanks!

kegg • 2.7k views
ADD COMMENT
2
Entering edit mode
12.6 years ago

Have you thought about using the KEGG API? See the following links for more information:

Also, BioRuby seems to have a pretty good API implemented:

As does the R Bioconductor KEGGSOAP package:

The following (simple) Python script should work a treat for now though ;)

#!/usr/bin/env python
"""
Python script to retrieve KEGG gene entry for a number of different genes
Coded by Steve Moss (gawbul [at] gmail [dot] com
http://about.me/gawbul
"""

# import required modules
from SOAPpy import WSDL

# setup kegg wsdl
kegg_wsdl = 'http://soap.genome.jp/KEGG.wsdl'
kegg_service = WSDL.Proxy(kegg_wsdl)

# setup array of gene names
gene_names = ("ALDOA", "BHLHB3", "PKM2", "P4HA1", "EPO")

# iterate of gene_names and retrieve sequences
for gene_name in gene_names:
    # use bfind first to find the list of genes for each query
    # limit to hsa (homo sapiens)
    gene_entries = kegg_service.bfind("genes " + gene_name + " hsa").rstrip("\n").split("\n") # returns str so split on \n, but remove last \n first
    print "Found %d entries for %s" % (len(gene_entries), gene_name)

    # iterate over gene_entries
    for gene_entry in gene_entries:
        # just use the first part of the string (e.g. hsa:226) to retrieve
        # the sequences in fasta format (-f)
        results = kegg_service.bget("-f " + gene_entry.split(" ")[0])
        # print results to screen
        print results

You could modify this to read the gene name entries from a file and feed them in that way, and perhaps also write the output to a file too, instead of displaying in STDOUT.

Essentially this uses the SOAP/WSDL framework to implement the equivalent of the HTTP URLs in a form readable by a computer (web service). You can build queries using the KEGG API just as you would a URL, e.g. the above "kegg_service.bget("-f hsa:" + gene_name)" is the same as calling http://www.genome.jp/dbget-bin/www_bget?-f+hsa:aldoa, except the data is returned in XML to the script, rather than HTML, as it would to the browser.

ADD COMMENT
0
Entering edit mode

Hi, Steve. Thank you for providing so many resources. Problem solved.

ADD REPLY
0
Entering edit mode

No problem :) Glad to be of assistance!

ADD REPLY

Login before adding your answer.

Traffic: 1150 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6