Tool:kg: query kegg from the command line
Entering edit mode
8.8 years ago

Since I often have columnar files I need to annotate with KEGG data, I wrote a dinky script that does it for me. Perhaps it will be of use to someone else too?

In the example below, you see a columnar file.

$ head examples/no_index_header.tsv
logFC   AveExpr
Ipcef1  -2.70987558746701   4.80047582653889
Sema3b  2.00143465979322    3.82969788437155
Rab26   -2.40250648553797   5.57320249609294
Arhgap25    -1.84668909768998   3.66617832656769
Ociad2  -1.99052684394044   5.26213130909702
Mmp17   -2.01026790614161   4.88012776225311
C4a 2.22003976804983    3.52842041243544
Gna14   -2.42391191670209   1.56313048066253
Kcna6   -1.74168813159872   6.54586068659631

Now, using the command

$ kg -s rno -m 0 -d examples/no_index_header.tsv

KEGG data related to the gene in column 0 (-m) is added to the file.

index   logFC   AveExpr kegg_pathway    kegg_definition
Ipcef1  -2.70987558746701   4.80047582653889    361474   interaction protein for cytohesin exchange factors 1
Sema3b  2.00143465979322    3.82969788437155    363142   sema domain, immunoglobulin domain (Ig), short basic domain, secreted, (semaphorin) 3B; K06840 semaphorin 3
Rab26   -2.40250648553797   5.57320249609294    171111   RAB26, member RAS oncogene family; K07913 Ras-related protein Rab-26
Arhgap25    -1.84668909768998   3.66617832656769    500246   Rho GTPase activating protein 25
Ociad2  -1.99052684394044   5.26213130909702    100361733    OCIA domain containing 2
Mmp17   -2.01026790614161   4.88012776225311    288626   matrix metallopeptidase 17; K07997 matrix metalloproteinase-17 (membrane-inserted) [EC:3.4.24.-]
C4a 2.22003976804983    3.52842041243544    24233    complement component 4A (Rodgers blood group); K03989 complement component 4
Gna14   -2.42391191670209   1.56313048066253    309242   guanine nucleotide binding protein, alpha 14; K04636 guanine nucleotide-binding protein subunit alpha-14
Gna14   -2.42391191670209   1.56313048066253    314046   ankyrin repeat and MYND domain containing 2
Kcna6   -1.74168813159872   6.54586068659631    64358    potassium channel, voltage gated shaker related subfamily A, member 6; K04879 potassium voltage-gated channel Shaker-related subfamily A member 6

Note that you can do the reverse and get genes from KEGG ids too. Finally, by not entering anything but a species, all data for that species is dumped to stdout.

kg also exposes a (Python) function called get_kegg(species) in the module kg.lib.

get_kegg downloads all gene, kegg id and kegg id definitions for that species, parses the data and returns it in a pandas dataframe.

pip install kg

Note that the install is rather expensive; a recent version of pandas, biopython, joblib and docopt are installed.

Full command line interface:


Get KEGG data from the command line.
(Visit for examples and help.)

    kg --help
    kg --mergecol=COL --species=SPEC [--genes] [--definitions] [--noheader] FILE
    kg --species=SPEC
    kg --removecache

    FILE                    infile to add KEGG data to (read STDIN with -)
    -s SPEC --species=SPEC  name of species (examples: hsa, mmu, rno...)
    -m COL --mergecol=COL   column (0-indexed int or name) containing gene names

    -h --help               show this message
    -n --noheader           the input data does not contain a header
    -d --definitions        add KEGG pathway definitions to the output
    -g --genes              get the genes related to KEGG pathways
                            (when used, mergecol COL should contain KEGG pathway
    --removecache           removes the local cache so that the KEGG REST DB is
                            accessed anew
kegg kg • 3.8k views
Entering edit mode

Cool! But this doesn't cover plants? I got zero results


Login before adding your answer.

Traffic: 1985 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6