Question: Having some regex problems capturing strings with special chars. Could use some help.
0
gravatar for a.j.wilson0000
3 days ago by
Johns Hopkins, Baltimore, USA
a.j.wilson000010 wrote:

Having a bit of trouble reformatting this messed up run log. I want to remove the strings of characters that did not translate correctly from linux terminal stdout into the log file and then replace those string with a \t, a \n, or white space. Doing it for a large number of files, so I need a command line solution.

Log sample:

The following malformed strings repeat for every entry in the log:

  • ^[[3J^[[H^[[2J^[[1;33m
  • ^[[0m^[[0;33m
  • ^[[0m^[[1;33m
  • ^[[0m|^H/^H-^H^H
  • ^[[1;37m
  • ^[[0m^[[0;37m
  • ^[[0m^[[1;37m
  • ^[[0m^[[0;37m
  • ^[[0m^[[0;37m^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^[[0m^[[0;37m
  • ^[[0m^[[1;32m
  • ^[[0m^[[0;32m

I've tried numerous gnu sed regexs to try to capture these with escaped special chars, but I keep getting 's/ ' unterminated errors (I think mainly due to that opening ^ in the strings?). Any pointers on how to go about doing this with sed or awk? Is there an easier way, perhaps with some sort of a find and replace python/perl script?

This is my current regex:

sed 's/\^\[\[3J\^\[\[H\^\[\[2J\^\[\[1;33m//g; s/\^\[\[0m\^\[\[0;33m//g; s/\^\[\[0m\^\[\[1;33m//g; s/\^\[\[0m|\^H\/\^H\-\^H\^H//g; s/\^\[\[1;37m//g; s/\^\[\[0m\^\[\[0;37m//g; s/\^\[\[0m//g; s/\^H//g; s/\^\[\[1;32m//g; s/\^\[\[0;32m//g' run.log > run_clean.log
regex sed awk • 125 views
ADD COMMENTlink modified 3 days ago by Jorge Amigo12k • written 3 days ago by a.j.wilson000010

I tried your command on a sample file and it worked for me.

Fatima-MacBook-Pro:~ Fatima$ cat tmp
^[[3J^[[H^[[2J^[[1;33m
^[[0m^[[0;33m
^[[0m^[[1;33m
^[[0m|^H/^H-^H^H
^[[1;37m
^[[0m^[[0;37m
^[[0m^[[1;37m
^[[0m^[[0;37m
^[[0m^[[0;37m^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^H^[[0m^[[0;37m
^[[0m^[[1;32m
^[[0m^[[0;32m

Fatima-MacBook-Pro:~ Fatima$ sed 's/\^\[\[3J\^\[\[H\^\[\[2J\^\[\[1;33m//g; s/\^\[\[0m\^\[\[0;33m//g; s/\^\[\[0m\^\[\[1;33m//g; s/\^\[\[0m|\^H\/\^H\-\^H\^H//g; s/\^\[\[1;37m//g; s/\^\[\[0m\^\[\[0;37m//g; s/\^\[\[0m//g; s/\^H//g; s/\^\[\[1;32m//g; s/\^\[\[0;32m//g' tmp

This link might help:

https://unix.stackexchange.com/questions/14684/removing-control-chars-including-console-codes-colours-from-script-output

ADD REPLYlink modified 3 days ago • written 3 days ago by Fatima630

Helpful to know it works for you and that my regex is at least correct. Something else is going wrong then I suppose.

Based on your suggestion about color codes, I think the answer might be due to the fact that sed is a stream editor and these are terminal ansi codes. If you cat the log file, the progress bar representations and colors show up as shown below.

https://pasteboard.co/JvYUOyh.png

So sed can't recognize the codes because it is essentially reading the file like cat.

ADD REPLYlink modified 3 days ago • written 3 days ago by a.j.wilson000010

Is this a bioinformatics question?

ADD REPLYlink written 3 days ago by Joe18k

More of a raw data skills question sure. I'm working on a bioinformatics pipeline of mtdna deletion calling using eKLIPse deletion caller. So yes, it is related to bioinformatics in that I'm trying to clean up the eKLIPse logs.

ADD REPLYlink written 1 day ago by a.j.wilson000010
1
gravatar for Jorge Amigo
3 days ago by
Jorge Amigo12k
Santiago de Compostela, Spain
Jorge Amigo12k wrote:

I can think of simplifying the regex a little bit using perl, in case it helps:

perl -pe 's/\^\[\[(2J|3J|H|[01](;3[237])?m)//g; s/\^H//g; s/\|\/-//' run.log > run_clean.log
ADD COMMENTlink written 3 days ago by Jorge Amigo12k

Thanks jorge, I'll try it out.

ADD REPLYlink written 3 days ago by a.j.wilson000010
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1942 users visited in the last hour