How to extract header from within the data (line of a file)
2
0
Entering edit mode
3.4 years ago
6schulte ▴ 30

Hi everyone :)

I have a file that contains information in each line like:

ID=1234_1      Name=First       size_aa=7890      start_type=none      Value=0.123
ID=1234_2      Name=Second      size_aa=7969      start_type=none      Value=0.122
ID=1233        Name=Third       size_aa=753       start_type=ft        Value=0.223
ID=445         Class=ED         size_aa=4653      start_type=fp        Value=0.223

...The space ' ' is supposed to be representative for a tab...

I would like to split it and get a file like:

ID        Name        size_aa    start_type  Value
1234_1    First       7890       none        0.123
1234_2    Second      7969       none        0.122
1233      Third       753        ft          0.223
445       Class=ED    4653       fp          0.223

I have tried different things but I never quiet get there and as I have gotten really nice tips the last two times that I asked for help on Biostars I decided to ask again... I hope you can help me out! Any help will be appreciated :)

P.S.: My approaches so far were built on the idea of splitting the file in to two files. One part would be used to work on the header, the other to work on the data. Once everything was deleted in-between a '=' and a tab, only the headers would remain. Then I would look for these tab separated words within the second part of the file and delete occurrences of these strings (including the '=') leaving only the value behind.

This seems overly complicated to me though... There is probably an easier solution!?

Thank you!

data extraction bash data processing • 1.7k views
ADD COMMENT
0
Entering edit mode

The approach I tossed in my answer will only solve part of your question, I just noticed.

so you want the "column " names only once in the output file? You are aware that some columns have different names apparently?

ADD REPLY
0
Entering edit mode

Yes, I want the "column" names only once in the output file. The case that some columns have different names is a worst case scenario. It should not be, but is possible. Which is why I was thinking about that in the case of an anomaly (like in line 4 "Class=ED" instead of "Name=XY"): I would like to keep the identifier 'Something=' with the value or even create a new column for this information at the end of the table and leave a blank space here.

This is why I would like to have the columns name as "Name" even though the entry in row 4 suggests differently.

ID        Name        size_aa    start_type  Value
...
445       Class=ED    4653       fp          0.223
ADD REPLY
2
Entering edit mode
3.4 years ago

see if this helps. 'class=ED' line is problem for a generic parser.

input:

$ cat test.txt
ID=1234_1       Name=First      size_aa=7890    start_type=none Value=0.123
ID=1234_2       Name=Second     size_aa=7969    start_type=none Value=0.122
ID=1233 Name=Third      size_aa=753     start_type=ft   Value=0.223
ID=445  Class=ED        size_aa=4653    start_type=fp   Value=0.223

output:

$  mlr --d2t --ifs '\t' cat test.txt
ID      Name    size_aa start_type      Value
1234_1  First   7890    none    0.123
1234_2  Second  7969    none    0.122
1233    Third   753     ft      0.223

ID      Class   size_aa start_type      Value
445     ED      4653    fp      0.223

miller is available in ubuntu/debian/other gnu-linux repos.

ADD COMMENT
0
Entering edit mode

This is great! Thank you! I will get back to my code in a few days. Then I will be able to give more detailed feedback and see if it works the same for me but I wanted to thank you already anyhow.

ADD REPLY
1
Entering edit mode
3.4 years ago

oneliner bash with sed can fix this:

cat <your file>| sed s/[A-Za-z_]*=//g

this will remove all in front (and including) the = sign from each line, retaining only the value parts

EDIT:

to get the header names you can indeed run the file twice, something like this for instance:

cat <( cat <your file>| sed s/=[0-9]*//g | head -1 ) <( cat <your file>| sed s/[A-Za-z_]*=//g )

What this does: if first executes the commands between <( ) and then those results are cat to screen The above command will keep the first line as header info, and ignore all other possible header names. You can make variations on this of course: eg. use uniq in stead of head -1 in the above and you will get on the top of the file all possible header names

ADD COMMENT
1
Entering edit mode

Thank you! This is actually pretty similar to what I have done just way more beautiful (got it working shortly after posting this question). As I needed way more lines for the same result. I will get back to my code in a few days. Then I will either implement your suggested code or the version suggested by cpad0112. Depending on whatever works best :) Anyhow, thank you for your response!

ADD REPLY
1
Entering edit mode

If an answer was helpful, you should upvote it; if the answer resolved your question, you should mark it as accepted. You can accept more than one if they work.
Upvote|Bookmark|Accept

ADD REPLY
1
Entering edit mode

Thank you for this comment! I will get right to it.

ADD REPLY

Login before adding your answer.

Traffic: 3213 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6