problem with KEGG pathway gene extraction
1
3
Entering edit mode
9.0 years ago
Abdullah ▴ 100

Hi,

I have been struggling to extract the set of genes related to each pathway from KEGG. I read multiple topics on biostar on how to do this but apparently there is something wrong. I downloaded the bret hierarchy file in order to get genes on the pathway level that I want. However, there seem to be multiple methods which some of them are working and the others are not.

In this topic, (Extracting List Of Genes Associated With A Pathway In Kegg) a nice method has been mentioned by Pierre on how to do this using the following url: http://togows.dbcls.jp/entry/pathway/hsa05200/genes.json. However, when I try this for other pathways, the json list is empty (e.g., http://togows.dbcls.jp/entry/pathway/hsa01200/genes.json).

Another method which I found in this answer (Kegg Data Download), was using the REST API. However, also for some of the pathways, there are no genes (e.g., http://rest.kegg.jp/get/hsa01200).

Surprisingly, if I take the same pathway, and I go to: http://www.genome.jp/dbget-bin/get_linkdb?-t+genes+path:hsa01200 I'm able to find the list of genes that I'm looking for.

Can someone tell me what is the difference between these 3 methods and what is the recommended method to do this.

kegg • 5.1k views
ADD COMMENT
1
Entering edit mode

The first method only has 401 lines for hsa05200, the latter method has close to 2000. My library uses the latter method. See https://github.com/endrebak/kg

My guess is that the first method is from the outdated kegg (it went unfunded for a while and the public data was not updated AFAIK), and the second is from the second run of KEGG, 2014-2016.

ADD REPLY
3
Entering edit mode
9.0 years ago
Kamil ★ 2.3k

I'm able to get a list of genes by using the kg tool as shown below. Perhaps this is enough to meet your needs. I would also like to be able to specify the type of gene identifier, such as Entrez ID or Ensembl ID (link to issue), but this has not been implemented yet as far as I can tell.

Command:

kg --mergecol=0 --noheader --genes --definition --species=hsa <(echo 01200)

Output: (trimmed)

Cache path is: /Users/slowikow/.kegg/ (Time:  Thu, 29 Oct 2015 10:10:24 )
Get KEGG gene to external gene map. (Time:  Thu, 29 Oct 2015 10:10:24 )
Get KEGG path to gene map. (Time:  Thu, 29 Oct 2015 10:10:25 )
Get KEGG pathway to definition map. (Time:  Thu, 29 Oct 2015 10:10:25 )
Connect KEGG gene map and KEGG pathway map. (Time:  Thu, 29 Oct 2015 10:10:26 )
/Users/slowikow/anaconda/lib/python2.7/site-packages/ebs/merge_cols.py:74: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  in_df = in_df.sort(pre_merge_sort_order_col)
Attaching pathway definitions. (Time:  Thu, 29 Oct 2015 10:10:26 )
01200   ME3     Carbon metabolism
01200   PGAMA   Carbon metabolism
01200   PGAM-B  Carbon metabolism
01200   HEL-S-35        Carbon metabolism
01200   PGAM1   Carbon metabolism
01200   PFKF    Carbon metabolism
ADD COMMENT
1
Entering edit mode
kg --mergecol=0 --noheader --genes --definition --species=hsa <(echo 01200) | biomartian -d hsap -i external_gene_name -o entrezgene --noheader -c 1 -

0    1    2    entrezgene
01200    ME3    Carbon metabolism    10873
01200    PGAMA    Carbon metabolism
01200    PGAM-B    Carbon metabolism
01200    HEL-S-35    Carbon metabolism
01200    PGAM1    Carbon metabolism    5223
01200    PFKF    Carbon metabolism
01200    PFK-P    Carbon metabolism

As you can see, not all external_gene_names have associated entrezgene ids.

ADD REPLY
0
Entering edit mode

Unfortunately, the truth is more complicated than one might expect. Indeed, biomartian does not retrieve the Entrez IDs for all genes, but the genes do in fact have associated Entrez IDs.

PGAMA is 5223, but it is also known as PGAM1; PGAM-B; HEL-S-35
PGAM-B is 441531, but is also known as PGAM4; PGAM1; PGAM3; dJ1000K24.1
HEL-S-35 is probably 5223
PFKF is 5214, but is also known as PFKP; PFK-C; PFK-P; ATP-PFK
PFK-P is probably 5214

I prefer to use identifiers like Entrez or Ensembl for data analysis to avoid the ambiguity of gene names. On the other hand, I use HGNC names for presentation. Endre, you might like to visit MyGene.info if you have not already. I think the developers have done an admirable job of dealing with name ambiguity.

ADD REPLY
0
Entering edit mode

Thanks for the info, I should switch to a different gene name format by default then. I think KEGGREST offers them, but the service is devoid of documentation.

ADD REPLY
0
Entering edit mode

Thank you. However, I'm not able to run this command from within python, probably it is a silly error but I'm not able to find what is it.

I'm using:

pathwayID = 01200

cmd="/usr/packages/kg-master/bin/kg --mergecol=0 --noheader --genes --definition --species=hsa <(echo {})".format(pathwayID)
tmp = os.popen(cmd).read()

and I'm getting this error:

sh: -c: line 0: syntax error near unexpected token `('
sh: -c: line 0: `/usr/packages/kg-master/bin/kg --mergecol=0 --noheader --genes --definition --species=hsa <(echo 05200)'
ADD REPLY
0
Entering edit mode

Why do you want to run it from within Python? Anyways, check out the kg/lib.py methods - I'll write some docs for them eventually.

ADD REPLY
0
Entering edit mode

thanks for your suggestion, I was looking for a fast way to run it. I managed to do this using:

cmd="/usr/packages/kg-master/bin/kg --mergecol=0 --noheader --genes --definition --species=hsa <(echo '{}')".format(pathwayID)

process = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE,stderr=subprocess.PIPE, executable="/bin/bash")
out, err = process.communicate()
ADD REPLY
0
Entering edit mode

I have updated kg and added an example of how to use it from within python: https://github.com/endrebak/kg

pip install kg==0.0.7
ADD REPLY
0
Entering edit mode

Thanks. When I run the API example, I get this error:

>>> df = get_kegg_data("rno", pathway_definitions)
>>> Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/joblib/memory.py", line 483, in __call__
    return self._cached_call(args, kwargs)[0]
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/joblib/memory.py", line 430, in _cached_call
    out, metadata = self.call(*args, **kwargs)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/joblib/memory.py", line 675, in call
    output = self.func(*args, **kwargs)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kegg/lib.py", line 32, in get_kegg_data
    "kegg_pathway")
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ebs/merge_cols.py", line 35, in attach_data
    merged_df = _post_merge_cleanup(merged_df, new_col_name)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ebs/merge_cols.py", line 74, in _post_merge_cleanup
    in_df = in_df.sort_values_by(pre_merge_sort_order_col)
  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/generic.py", line 2246, in __getattr__
    (type(self).__name__, name))
AttributeError: 'DataFrame' object has no attribute 'sort_values_by'
ADD REPLY
0
Entering edit mode

I'll fix it later today, thanks.

ADD REPLY
0
Entering edit mode

Now it should work. pip install kg==0.0.8

ADD REPLY
0
Entering edit mode

Thanks. The API example is working now. However, the command line tool still throw this error. It seems that the git is not updated?

ADD REPLY
0
Entering edit mode

That is something on your end. The CLI and API uses the same code. It probably stems from you having multiple versions of Python installed and different versions of the library installed for each. If you encounter more problems please tell me at https://github.com/endrebak/kg/issues

ADD REPLY
0
Entering edit mode

The current version in git seems to be working now. Thanks ;)

ADD REPLY

Login before adding your answer.

Traffic: 986 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6