Question: problem with KEGG pathway gene extraction
3
gravatar for Abdullah
3.4 years ago by
Abdullah100
Germany
Abdullah100 wrote:

Hi,

I have been struggling to extract the set of genes related to each pathway from KEGG. I read multiple topics on biostar on how to do this but apparently there is something wrong. I downloaded the bret hierarchy file in order to get genes on the pathway level that I want. However, there seem to be multiple methods which some of them are working and the others are not.

In this topic, (Extracting List Of Genes Associated With A Pathway In Kegg) a nice method has been mentioned by Pierre on how to do this using the following url: http://togows.dbcls.jp/entry/pathway/hsa05200/genes.json. However, when I try this for other pathways, the json list is emtpy (e.g., http://togows.dbcls.jp/entry/pathway/hsa01200/genes.json).

Another method which I found in this answer (Kegg Data Download), was using the REST API. However, also for some of the pathways, there are no genes (e.g., http://rest.kegg.jp/get/hsa01200).

Surprisingly, if I take the same pathway, and I go to: http://www.genome.jp/dbget-bin/get_linkdb?-t+genes+path:hsa01200 I'm able to find the list of genes that I'm looking for.

Can someone tell me what is the difference between these 3 methods and what is the recommended method to do this.

kegg • 1.7k views
ADD COMMENTlink modified 3.4 years ago by Kamil1.9k • written 3.4 years ago by Abdullah100
1

The first method only has 401 lines for hsa05200, the latter method has close to 2000. My library uses the latter method. See https://github.com/endrebak/kg

My guess is that the first method is from the outdated kegg (it went unfunded for a while and the public data was not updated AFAIK), and the second is from the second run of KEGG, 2014-2016. 

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Endre Bakken Stovner880
3
gravatar for Kamil
3.4 years ago by
Kamil1.9k
Boston
Kamil1.9k wrote:

I'm able to get a list of genes by using the kg tool as shown below. Perhaps this is enough to meet your needs. I would also like to be able to specify the type of gene identifier, such as Entrez ID or Ensembl ID (link to issue), but this has not been implemented yet as far as I can tell.

Command:

kg --mergecol=0 --noheader --genes --definition --species=hsa <(echo 01200)

Output: (trimmed)

Cache path is: /Users/slowikow/.kegg/ (Time:  Thu, 29 Oct 2015 10:10:24 )
Get KEGG gene to external gene map. (Time:  Thu, 29 Oct 2015 10:10:24 )
Get KEGG path to gene map. (Time:  Thu, 29 Oct 2015 10:10:25 )
Get KEGG pathway to definition map. (Time:  Thu, 29 Oct 2015 10:10:25 )
Connect KEGG gene map and KEGG pathway map. (Time:  Thu, 29 Oct 2015 10:10:26 )
/Users/slowikow/anaconda/lib/python2.7/site-packages/ebs/merge_cols.py:74: FutureWarning: sort(columns=....) is deprecated, use sort_values(by=.....)
  in_df = in_df.sort(pre_merge_sort_order_col)
Attaching pathway definitions. (Time:  Thu, 29 Oct 2015 10:10:26 )
01200   ME3     Carbon metabolism
01200   PGAMA   Carbon metabolism
01200   PGAM-B  Carbon metabolism
01200   HEL-S-35        Carbon metabolism
01200   PGAM1   Carbon metabolism
01200   PFKF    Carbon metabolism
ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by Kamil1.9k
1
kg --mergecol=0 --noheader --genes --definition --species=hsa <(echo 01200) | biomartian -d hsap -i external_gene_name -o entrezgene --noheader -c 1 -

0    1    2    entrezgene
01200    ME3    Carbon metabolism    10873
01200    PGAMA    Carbon metabolism
01200    PGAM-B    Carbon metabolism
01200    HEL-S-35    Carbon metabolism
01200    PGAM1    Carbon metabolism    5223
01200    PFKF    Carbon metabolism
01200    PFK-P    Carbon metabolism

As you can see, not all external_gene_names have associated entrezgene ids.

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Endre Bakken Stovner880

Unfortunately, the truth is more complicated than one might expect. Indeed, biomartian does not retrieve the Entrez IDs for all genes, but the genes do in fact have associated Entrez IDs.

PGAMA is 5223, but it is also known as PGAM1; PGAM-B; HEL-S-35
PGAM-B is 441531, but is also known as PGAM4; PGAM1; PGAM3; dJ1000K24.1
HEL-S-35 is probably 5223
PFKF is 5214, but is also known as PFKP; PFK-C; PFK-P; ATP-PFK
PFK-P is probably 5214

I prefer to use identifiers like Entrez or Ensembl for data analysis to avoid the ambiguity of gene names. On the other hand, I use HGNC names for presentation. Endre, you might like to visit MyGene.info if you have not already. I think the developers have done an admirable job of dealing with name ambiguity.

ADD REPLYlink written 3.4 years ago by Kamil1.9k

Thanks for the info, I should switch to a different gene name format by default then. I think KEGGREST offers them, but the service is devoid of documentation.

 

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Endre Bakken Stovner880

Thank you. However, I'm not able to run this command from within python, probably it is a silly error but i'm not able to find what is it.

I'm using:
 

pathwayID = 01200

cmd="/usr/packages/kg-master/bin/kg --mergecol=0 --noheader --genes --definition --species=hsa <(echo {})".format(pathwayID)

tmp = os.popen(cmd).read()

 

and I'm getting this error:

sh: -c: line 0: syntax error near unexpected token `('

sh: -c: line 0: `/usr/packages/kg-master/bin/kg --mergecol=0 --noheader --genes --definition --species=hsa <(echo 05200)'

 

 

ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Abdullah100

Why do you want to run it from within Python? Anyways, check out the `kg/lib.py` methods - I'll write some docs for them eventually.

ADD REPLYlink written 3.4 years ago by Endre Bakken Stovner880

thanks for your suggestion, I was looking for a fast way to run it. I managed to do this using:

 

cmd="/usr/packages/kg-master/bin/kg --mergecol=0 --noheader --genes --definition --species=hsa <(echo '{}')".format(pathwayID)

process = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE,stderr=subprocess.PIPE, executable="/bin/bash")
out, err = process.communicate()
ADD REPLYlink modified 3.4 years ago • written 3.4 years ago by Abdullah100

I have updated kg and added an example of how to use it from within python: https://github.com/endrebak/kg

pip install kg==0.0.7 

ADD REPLYlink written 3.4 years ago by Endre Bakken Stovner880

Thanks. When I run the API example, I get this error:

 

>>> df = get_kegg_data("rno", pathway_definitions)

>>> Traceback (most recent call last):

  File "<stdin>", line 1, in <module>

  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/joblib/memory.py", line 483, in __call__

    return self._cached_call(args, kwargs)[0]

  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/joblib/memory.py", line 430, in _cached_call

    out, metadata = self.call(*args, **kwargs)

  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/joblib/memory.py", line 675, in call

    output = self.func(*args, **kwargs)

  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/kegg/lib.py", line 32, in get_kegg_data

    "kegg_pathway")

  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ebs/merge_cols.py", line 35, in attach_data

    merged_df = _post_merge_cleanup(merged_df, new_col_name)

  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/ebs/merge_cols.py", line 74, in _post_merge_cleanup

    in_df = in_df.sort_values_by(pre_merge_sort_order_col)

  File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/pandas/core/generic.py", line 2246, in __getattr__

    (type(self).__name__, name))

AttributeError: 'DataFrame' object has no attribute 'sort_values_by'
ADD REPLYlink written 3.4 years ago by Abdullah100

I'll fix it later today, thanks.

ADD REPLYlink written 3.4 years ago by Endre Bakken Stovner880

Now it should work. pip install kg==0.0.8

ADD REPLYlink written 3.4 years ago by Endre Bakken Stovner880

 Thanks. The API example is working now. However, the command line tool still throw this error. It seems that the git is not updated?

ADD REPLYlink written 3.4 years ago by Abdullah100

That is something on your end. The CLI and API uses the same code. It probably stems from you having multiple versions of Python installed and different versions of the library installed for each. If you encounter more problems please tell me at https://github.com/endrebak/kg/issues  

ADD REPLYlink written 3.4 years ago by Endre Bakken Stovner880

The current version in git seems to be working now. Thanks ;)

ADD REPLYlink written 3.4 years ago by Abdullah100
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 880 users visited in the last hour