Question: KEGG data parse
0
gravatar for sharmatina189059
9 months ago by
United States
sharmatina18905930 wrote:

Hello I have a file like

ENTRY       EC 1.1.1.1                  Enzyme
NAME        alcohol dehydrogenase;
CLASS       Oxidoreductases;
SYSNAME     alcohol:NAD+ oxidoreductase
REACTION    (1) a primary alcohol + NAD+ = an aldehyde + NADH + H+ [RN:R00623];
ALL_REAC    R00623 > R00754 R02124 R02878 R04805 R04880 R05233 R05234 R06917 R06927 R08281 R08306 R08557 R08558 R10783;
SUBSTRATE   primary alcohol [CPD:C00226];
PRODUCT     aldehyde [CPD:C00071];
ENTRY       EC 1.1.1.157                Enzyme
NAME        3-hydroxybutyryl-CoA dehydrogenase;
CLASS       Oxidoreductases;
SYSNAME     (S)-3-hydroxybutanoyl-CoA:NADP+ oxidoreductase
REACTION    (S)-3-hydroxybutanoyl-CoA + NADP+ = 3-acetoacetyl-CoA + NADPH + H+ [RN:R01976]
ALL_REAC    R01976;
SUBSTRATE   (S)-3-hydroxybutanoyl-CoA [CPD:C01144];
PRODUCT     3-acetoacetyl-CoA [CPD:C00332];

and i need to convert it to

ENTRY NAME CLASS SYSNAME REACTION ALL_REAC SUBSTRATE PRODUCT

and the corresponding values in rows. can anybody help me writing a script for this purpose.

R data parsing • 375 views
ADD COMMENTlink modified 9 months ago by Paul1.1k • written 9 months ago by sharmatina18905930
$ awk -F "     "  'FNR<9 {sub(" ","\t");gsub(";","");print $1,$2}' test | datamash transpose --no-strict | tr -d " " > out.txt

output (tab separated):

 $ cat out.txt 
ENTRY   NAME    CLASS   SYSNAME REACTION    ALL_REAC    SUBSTRATE   PRODUCT
EC1.1.1.1   alcoholdehydrogenase    Oxidoreductases alcohol:NAD+oxidoreductase  (1)aprimaryalcohol+NAD+=analdehyde+NADH+H+[RN:R00623]   R00623>R00754R02124R02878R04805R04880R05233R05234R06917R06927R08281R08306R08557R08558R10783 primaryalcohol[CPD:C00226]  aldehyde[CPD:C00071]
ADD REPLYlink modified 8 months ago • written 9 months ago by cpad01127.7k

This command gives correct output for first entry only. Can you please manipulate it to the entire file. I am not meticulous in awk.

ADD REPLYlink written 8 months ago by sharmatina18905930

input:

$ cat test
ENTRY       EC 1.1.1.1                  Enzyme
NAME        alcohol dehydrogenase;
CLASS       Oxidoreductases;
SYSNAME     alcohol:NAD+ oxidoreductase
REACTION    (1) a primary alcohol + NAD+ = an aldehyde + NADH + H+ [RN:R00623];
ALL_REAC    R00623 > R00754 R02124 R02878 R04805 R04880 R05233 R05234 R06917 R06927 R08281 R08306 R08557 R08558 R10783;
SUBSTRATE   primary alcohol [CPD:C00226];
PRODUCT     aldehyde [CPD:C00071];
ENTRY       EC 1.1.1.157                Enzyme
NAME        3-hydroxybutyryl-CoA dehydrogenase;
CLASS       Oxidoreductases;
SYSNAME     (S)-3-hydroxybutanoyl-CoA:NADP+ oxidoreductase
REACTION    (S)-3-hydroxybutanoyl-CoA + NADP+ = 3-acetoacetyl-CoA + NADPH + H+ [RN:R01976]
ALL_REAC    R01976;
SUBSTRATE   (S)-3-hydroxybutanoyl-CoA [CPD:C01144];
PRODUCT     3-acetoacetyl-CoA [CPD:C00332];

command:

 $ sed 's/\s\+/\t /;s/.*ENT/\n&/g;s/          /\t/g' test.txt | cut -f1,2 | mlr --ixtab --omd cat | sed '2d;s/| //;s/\s*|\s*/\t/g'

output:

ENTRY   NAME    CLASS   SYSNAME REACTION    ALL_REAC    SUBSTRATE   PRODUCT 
EC 1.1.1.1  alcohol dehydrogenase;  Oxidoreductases;    alcohol:NAD+ oxidoreductase (1) a primary alcohol + NAD+ = an aldehyde + NADH + H+ [RN:R00623]; R00623 > R00754 R02124 R02878 R04805 R04880 R05233 R05234 R06917 R06927 R08281 R08306 R08557 R08558 R10783; primary alcohol [CPD:C00226];   aldehyde [CPD:C00071];  
EC 1.1.1.157    3-hydroxybutyryl-CoA dehydrogenase; Oxidoreductases;    (S)-3-hydroxybutanoyl-CoA:NADP+ oxidoreductase  (S)-3-hydroxybutanoyl-CoA + NADP+ = 3-acetoacetyl-CoA + NADPH + H+ [RN:R01976]  R01976; (S)-3-hydroxybutanoyl-CoA [CPD:C01144]; 3-acetoacetyl-CoA [CPD:C00332];

miller can be installed via ubuntu (till xenial-16.04)/mint (sonya- 18.2) repos. However, you would need latest version of Miller. Compile it from miller github.

ADD REPLYlink modified 8 months ago • written 8 months ago by cpad01127.7k
5
gravatar for Paul
9 months ago by
Paul1.1k
European Union
Paul1.1k wrote:

Hi, this solution works on your example data. I just erase first column and substitute spaces with comma. Then used translate and paste command. Finally add header to your requirements. This works in case we still have the same number of rows.

Please test it.

awk -v OFS="," '$1=$1' INPUT | awk -F"," '{for( i=2; i<=NF; i++ ){printf( "%s ", $i )}; printf( "\n"); }'  | tr " " "," | paste - - - - - - - -  | awk -v OFS="\t" 'BEGIN{print "ENTRY","NAME","CLASS","SYSNAME","REACTION","ALL_REACT","SUBSTRATE","PRODUCT"}1'
ADD COMMENTlink modified 9 months ago • written 9 months ago by Paul1.1k

What should I do If I want to replace multiple space with comma not a single space with comma.

ADD REPLYlink written 8 months ago by sharmatina18905930
1

Hi, try to use sed: sed 's/ \{1,\}/,/g' file or if you prefer tr: tr -s ' ' < file | tr ' ' ',' . And what about my script? Does it work to you?

ADD REPLYlink written 8 months ago by Paul1.1k

yes I tried it.. It works well. Thank you so much.

ADD REPLYlink written 8 months ago by sharmatina18905930

But what to do if my text is in multi-line.

ADD REPLYlink written 8 months ago by sharmatina18905930

Could you please copy/paste more example of your text? I'll look at it :-)

ADD REPLYlink written 8 months ago by Paul1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1490 users visited in the last hour