Hi, I have a list of human gene names from KEGG database, for example, ALDOA, BHLHB3, PKM2, P4HA1, EPO. I can get a list of genes with the same name in several species through searching the KEGG database, then click the one linking to the human gene hsa:226, and finally get the amino acid sequence and nucleotide sequence. Since there are hundreds of genes, this is apparently not efficient. I wonder whether there is a convenient way to finish this job. Many thanks!
Have you thought about using the KEGG API? See the following links for more information:
Also, BioRuby seems to have a pretty good API implemented:
As does the R Bioconductor KEGGSOAP package:
The following (simple) Python script should work a treat for now though ;)
#!/usr/bin/env python """ Python script to retrieve KEGG gene entry for a number of different genes Coded by Steve Moss (gawbul [at] gmail [dot] com http://about.me/gawbul """ # import required modules from SOAPpy import WSDL # setup kegg wsdl kegg_wsdl = 'http://soap.genome.jp/KEGG.wsdl' kegg_service = WSDL.Proxy(kegg_wsdl) # setup array of gene names gene_names = ("ALDOA", "BHLHB3", "PKM2", "P4HA1", "EPO") # iterate of gene_names and retrieve sequences for gene_name in gene_names: # use bfind first to find the list of genes for each query # limit to hsa (homo sapiens) gene_entries = kegg_service.bfind("genes " + gene_name + " hsa").rstrip("\n").split("\n") # returns str so split on \n, but remove last \n first print "Found %d entries for %s" % (len(gene_entries), gene_name) # iterate over gene_entries for gene_entry in gene_entries: # just use the first part of the string (e.g. hsa:226) to retrieve # the sequences in fasta format (-f) results = kegg_service.bget("-f " + gene_entry.split(" ")) # print results to screen print results
You could modify this to read the gene name entries from a file and feed them in that way, and perhaps also write the output to a file too, instead of displaying in STDOUT.
Essentially this uses the SOAP/WSDL framework to implement the equivalent of the HTTP URLs in a form readable by a computer (web service). You can build queries using the KEGG API just as you would a URL, e.g. the above "kegg_service.bget("-f hsa:" + gene_name)" is the same as calling http://www.genome.jp/dbget-bin/www_bget?-f+hsa:aldoa, except the data is returned in XML to the script, rather than HTML, as it would to the browser.