Question

KEGG data parse

0

Entering edit mode

6.5 years ago

sharmatina189059 ▴ 110

Hello I have a file like

ENTRY       EC 1.1.1.1                  Enzyme
NAME        alcohol dehydrogenase;
CLASS       Oxidoreductases;
SYSNAME     alcohol:NAD+ oxidoreductase
REACTION    (1) a primary alcohol + NAD+ = an aldehyde + NADH + H+ [RN:R00623];
ALL_REAC    R00623 > R00754 R02124 R02878 R04805 R04880 R05233 R05234 R06917 R06927 R08281 R08306 R08557 R08558 R10783;
SUBSTRATE   primary alcohol [CPD:C00226];
PRODUCT     aldehyde [CPD:C00071];
ENTRY       EC 1.1.1.157                Enzyme
NAME        3-hydroxybutyryl-CoA dehydrogenase;
CLASS       Oxidoreductases;
SYSNAME     (S)-3-hydroxybutanoyl-CoA:NADP+ oxidoreductase
REACTION    (S)-3-hydroxybutanoyl-CoA + NADP+ = 3-acetoacetyl-CoA + NADPH + H+ [RN:R01976]
ALL_REAC    R01976;
SUBSTRATE   (S)-3-hydroxybutanoyl-CoA [CPD:C01144];
PRODUCT     3-acetoacetyl-CoA [CPD:C00332];

and i need to convert it to

ENTRY NAME CLASS SYSNAME REACTION ALL_REAC SUBSTRATE PRODUCT

and the corresponding values in rows. can anybody help me writing a script for this purpose.

R data parsing • 2.0k views

ADD COMMENT • link updated 6.5 years ago by Paul ★ 1.5k • written 6.5 years ago by sharmatina189059 ▴ 110

0

Entering edit mode

$ awk -F "     "  'FNR<9 {sub(" ","\t");gsub(";","");print $1,$2}' test | datamash transpose --no-strict | tr -d " " > out.txt

output (tab separated):

 $ cat out.txt 
ENTRY   NAME    CLASS   SYSNAME REACTION    ALL_REAC    SUBSTRATE   PRODUCT
EC1.1.1.1   alcoholdehydrogenase    Oxidoreductases alcohol:NAD+oxidoreductase  (1)aprimaryalcohol+NAD+=analdehyde+NADH+H+[RN:R00623]   R00623>R00754R02124R02878R04805R04880R05233R05234R06917R06927R08281R08306R08557R08558R10783 primaryalcohol[CPD:C00226]  aldehyde[CPD:C00071]

ADD REPLY • link 6.4 years ago by cpad0112 21k

0

Entering edit mode

This command gives correct output for first entry only. Can you please manipulate it to the entire file. I am not meticulous in awk.

ADD REPLY • link 6.5 years ago by sharmatina189059 ▴ 110

0

Entering edit mode

input:

$ cat test
ENTRY       EC 1.1.1.1                  Enzyme
NAME        alcohol dehydrogenase;
CLASS       Oxidoreductases;
SYSNAME     alcohol:NAD+ oxidoreductase
REACTION    (1) a primary alcohol + NAD+ = an aldehyde + NADH + H+ [RN:R00623];
ALL_REAC    R00623 > R00754 R02124 R02878 R04805 R04880 R05233 R05234 R06917 R06927 R08281 R08306 R08557 R08558 R10783;
SUBSTRATE   primary alcohol [CPD:C00226];
PRODUCT     aldehyde [CPD:C00071];
ENTRY       EC 1.1.1.157                Enzyme
NAME        3-hydroxybutyryl-CoA dehydrogenase;
CLASS       Oxidoreductases;
SYSNAME     (S)-3-hydroxybutanoyl-CoA:NADP+ oxidoreductase
REACTION    (S)-3-hydroxybutanoyl-CoA + NADP+ = 3-acetoacetyl-CoA + NADPH + H+ [RN:R01976]
ALL_REAC    R01976;
SUBSTRATE   (S)-3-hydroxybutanoyl-CoA [CPD:C01144];
PRODUCT     3-acetoacetyl-CoA [CPD:C00332];

command:

 $ sed 's/\s\+/\t /;s/.*ENT/\n&/g;s/          /\t/g' test.txt | cut -f1,2 | mlr --ixtab --omd cat | sed '2d;s/| //;s/\s*|\s*/\t/g'

output:

ENTRY   NAME    CLASS   SYSNAME REACTION    ALL_REAC    SUBSTRATE   PRODUCT 
EC 1.1.1.1  alcohol dehydrogenase;  Oxidoreductases;    alcohol:NAD+ oxidoreductase (1) a primary alcohol + NAD+ = an aldehyde + NADH + H+ [RN:R00623]; R00623 > R00754 R02124 R02878 R04805 R04880 R05233 R05234 R06917 R06927 R08281 R08306 R08557 R08558 R10783; primary alcohol [CPD:C00226];   aldehyde [CPD:C00071];  
EC 1.1.1.157    3-hydroxybutyryl-CoA dehydrogenase; Oxidoreductases;    (S)-3-hydroxybutanoyl-CoA:NADP+ oxidoreductase  (S)-3-hydroxybutanoyl-CoA + NADP+ = 3-acetoacetyl-CoA + NADPH + H+ [RN:R01976]  R01976; (S)-3-hydroxybutanoyl-CoA [CPD:C01144]; 3-acetoacetyl-CoA [CPD:C00332];

miller can be installed via ubuntu (till xenial-16.04)/mint (sonya- 18.2) repos. However, you would need latest version of Miller. Compile it from miller github.

ADD REPLY • link 6.4 years ago by cpad0112 21k

score 5 · Answer 1 · 2017-10-13

5

Entering edit mode

6.5 years ago

Paul ★ 1.5k

Hi, this solution works on your example data. I just erase first column and substitute spaces with comma. Then used translate and paste command. Finally add header to your requirements. This works in case we still have the same number of rows.

Please test it.

awk -v OFS="," '$1=$1' INPUT | awk -F"," '{for( i=2; i<=NF; i++ ){printf( "%s ", $i )}; printf( "\n"); }'  | tr " " "," | paste - - - - - - - -  | awk -v OFS="\t" 'BEGIN{print "ENTRY","NAME","CLASS","SYSNAME","REACTION","ALL_REACT","SUBSTRATE","PRODUCT"}1'

ADD COMMENT • link 6.5 years ago by Paul ★ 1.5k

0

Entering edit mode

What should I do If I want to replace multiple space with comma not a single space with comma.

ADD REPLY • link 6.5 years ago by sharmatina189059 ▴ 110

1

Entering edit mode

Hi, try to use sed: sed 's/ \{1,\}/,/g' file or if you prefer tr: tr -s ' ' < file | tr ' ' ',' . And what about my script? Does it work to you?

ADD REPLY • link 6.5 years ago by Paul ★ 1.5k

0

Entering edit mode

yes I tried it.. It works well. Thank you so much.

ADD REPLY • link 6.4 years ago by sharmatina189059 ▴ 110

0

Entering edit mode

But what to do if my text is in multi-line.

ADD REPLY • link 6.4 years ago by sharmatina189059 ▴ 110

0

Entering edit mode

Could you please copy/paste more example of your text? I'll look at it :-)

ADD REPLY • link 6.4 years ago by Paul ★ 1.5k