Question: Scripting solution to generate a list of KEGG ORTHOLOGY (KO) terms from a tab-delimited annotation file
gravatar for jvire1
13 days ago by
jvire110 wrote:

Does anyone happen to know a basic scripting (perhaps awk or python) approach to extracting KEGG orthology terms from a tab delimited annotation file?

The file in question has rows that look look like this:

TRINITY_DN18877_c0_g1_i1    KEGG:zma:103654828`KEGG:zma:103654829`KEGG:zma:542341`KO:K02995
TRINITY_DN6301_c0_g1_i1     KEGG:zma:103647201`KO:K10798
TRINITY_DN12892_c3_g5_i1    KEGG:zma:103643875
TRINITY_DN13158_c1_g2_i35   KEGG:vvi:100249085`KO:K02435

What I'm ultimately needing is to extract the transcript ID in column one and the ko terms in column two. Like this:

TRINITY_DN6301_c0_g1_i1     K10798

The end goal is to use the list with KEGG Mapper ( to see what KEGG pathways are present and most abundant in my transcriptome assembly.

rna-seq • 123 views
ADD COMMENTlink modified 13 days ago by Sparrow_kop90 • written 13 days ago by jvire110
gravatar for Pierre Lindenbaum
13 days ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum98k wrote:
awk '{n=split($2,a,/`/);for(i=1;i<=n;++i) if(substr(a[i],1,3)=="KO:") printf("%s %s\n",$1,substr(a[i],4));}' input.txt
TRINITY_DN18877_c0_g1_i1 K02995
TRINITY_DN6301_c0_g1_i1 K10798
TRINITY_DN13158_c1_g2_i35 K02435
ADD COMMENTlink written 13 days ago by Pierre Lindenbaum98k

Thank you! Worked like a charm.


ADD REPLYlink written 13 days ago by jvire110
gravatar for Sparrow_kop
13 days ago by
Sparrow_kop90 wrote:

In python, I assume the delimiter is tab

with open('your_file','r') as f:
    for line in f:
        if 'KO:' in line:
            line = line.strip().split('\t')
            print(line[0] + '\t' + line[1].split(':')[-1])
ADD COMMENTlink written 13 days ago by Sparrow_kop90
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 925 users visited in the last hour