Question

sed regex code help wanted

0

Entering edit mode

5.7 years ago

bgold04 • 0

In an output file, I want to preserve these lines:

Prevotella_sp 
Leptospira_interrogans 
Leptospira_interrogans
Escherichia_coli

yet get rid of the alphanumeric and underscore in front of the genus_species in these lines:

ADWS01000032_Escherichia_coli 
EQ973222_Bacteroides_fragilis     
AEEI01000021_Prevotella_marshii     
AEXO01000076_Prevotella_denticola     
EQ973222_Bacteroides_fragilis     
ACIY01000543_Enterococcus_faecium     
ACIY01000542_Enterococcus_faecium

I have tried this, the intention of which is to match at least two upper case letters and two numbers in a row with wildcards on each side until the underscore, among other things, to no avail.

sed 's/^[^*[A-Z]{2}[0-9]{2}*_]//' input.file > output.file

I appreciate any and all suggestions. I am still not very good with regex.

Bert Gold

regular expressions sed microbiome • 1.5k views

ADD COMMENT • link updated 5.7 years ago by Ram 43k • written 5.7 years ago by bgold04 • 0

0

Entering edit mode

I found the strings you want all contains single underscore in it, while the ones you don't want all contains double underscores in it...Thus it is pretty easy to separate them, do not even need to use regex. But I'm not good at sed, I don't know how to do it in sed.

but, in python, you can split the string by underscore into list, then filter it by the length of the results..

edit: seems I misunderstand your question... What you actually want is not a regex search, it's string manipulation...

ADD REPLY • link 5.7 years ago by shoujun.gu ▴ 380

0

Entering edit mode

5.7 years ago

Damian Kao 16k

If all your lines have ID_genus_species only (3 items), then you can probably just use cut:

cut -f 2,3 -d '_' input.file

ADD COMMENT • link 5.7 years ago by Damian Kao 16k

0

Entering edit mode

sorry, all the entries do not have 3 items; that's why I asked the question... Thanks for thinking about this though.... -- Bert

ADD REPLY • link 5.7 years ago by bgold04 • 0

0

Entering edit mode

@cpad0112's solution should work fine.

ADD REPLY • link 5.7 years ago by GenoMax 141k

0

Entering edit mode

May be you should post those (entries without 3 items) along with entries in OP bgold04

ADD REPLY • link 5.7 years ago by cpad0112 21k

score 3 · Accepted Answer · 2018-08-16

output

$ sed 's/^\w\+[0-9]\+_//' test.txt 
Escherichia_coli 
Bacteroides_fragilis 
Prevotella_marshii 
Prevotella_denticola 
Bacteroides_fragilis 
Enterococcus_faecium 
Enterococcus_faecium

in bash:

$ grep -Po '(?<=[0-9]_).*' test.txt 
Escherichia_coli 
Bacteroides_fragilis 
Prevotella_marshii 
Prevotella_denticola 
Bacteroides_fragilis 
Enterococcus_faecium 
Enterococcus_faecium

input:

$ cat test.txt 
ADWS01000032_Escherichia_coli 
EQ973222_Bacteroides_fragilis 
AEEI01000021_Prevotella_marshii 
AEXO01000076_Prevotella_denticola 
EQ973222_Bacteroides_fragilis 
ACIY01000543_Enterococcus_faecium 
ACIY01000542_Enterococcus_faecium