Does anyone happen to know a basic scripting (perhaps awk or python) approach to extracting KEGG orthology terms from a tab delimited annotation file?
The file in question has rows that look look like this:
TRINITY_DN18877_c0_g1_i1 KEGG:zma:103654828`KEGG:zma:103654829`KEGG:zma:542341`KO:K02995 TRINITY_DN6301_c0_g1_i1 KEGG:zma:103647201`KO:K10798 TRINITY_DN12892_c3_g5_i1 KEGG:zma:103643875 TRINITY_DN13158_c1_g2_i35 KEGG:vvi:100249085`KO:K02435
What I'm ultimately needing is to extract the transcript ID in column one and the ko terms in column two. Like this:
The end goal is to use the list with KEGG Mapper (http://www.kegg.jp/kegg/tool/map_pathway.html) to see what KEGG pathways are present and most abundant in my transcriptome assembly.