Question: sed regex code help wanted
0
gravatar for bgold04
6 months ago by
bgold040
United States
bgold040 wrote:

In an output file, I want to preserve these lines:

Prevotella_sp 
Leptospira_interrogans 
Leptospira_interrogans
Escherichia_coli

yet get rid of the alphanumeric and underscore in front of the genus_species in these lines:

ADWS01000032_Escherichia_coli 
EQ973222_Bacteroides_fragilis     
AEEI01000021_Prevotella_marshii     
AEXO01000076_Prevotella_denticola     
EQ973222_Bacteroides_fragilis     
ACIY01000543_Enterococcus_faecium     
ACIY01000542_Enterococcus_faecium

I have tried this, the intention of which is to match at least two upper case letters and two numbers in a row with wildcards on each side until the underscore, among other things, to no avail.

sed 's/^[^*[A-Z]{2}[0-9]{2}*_]//' input.file > output.file

I appreciate any and all suggestions. I am still not very good with regex.

Bert Gold

ADD COMMENTlink modified 6 months ago by RamRS20k • written 6 months ago by bgold040

I found the strings you want all contains single underscore in it, while the ones you don't want all contains double underscores in it...Thus it is pretty easy to separate them, do not even need to use regex. But I'm not good at sed, I don't know how to do it in sed.

but, in python, you can split the string by underscore into list, then filter it by the length of the results..

edit: seems I misunderstand your question... What you actually want is not a regex search, it's string manipulation...

ADD REPLYlink modified 6 months ago • written 6 months ago by shoujun.gu360
3
gravatar for cpad0112
6 months ago by
cpad011211k
India
cpad011211k wrote:

output

$ sed 's/^\w\+[0-9]\+_//' test.txt 
Escherichia_coli 
Bacteroides_fragilis 
Prevotella_marshii 
Prevotella_denticola 
Bacteroides_fragilis 
Enterococcus_faecium 
Enterococcus_faecium

in bash:

$ grep -Po '(?<=[0-9]_).*' test.txt 
Escherichia_coli 
Bacteroides_fragilis 
Prevotella_marshii 
Prevotella_denticola 
Bacteroides_fragilis 
Enterococcus_faecium 
Enterococcus_faecium

input:

$ cat test.txt 
ADWS01000032_Escherichia_coli 
EQ973222_Bacteroides_fragilis 
AEEI01000021_Prevotella_marshii 
AEXO01000076_Prevotella_denticola 
EQ973222_Bacteroides_fragilis 
ACIY01000543_Enterococcus_faecium 
ACIY01000542_Enterococcus_faecium
ADD COMMENTlink modified 6 months ago • written 6 months ago by cpad011211k
0
gravatar for Damian Kao
6 months ago by
Damian Kao15k
USA
Damian Kao15k wrote:

If all your lines have ID_genus_species only (3 items), then you can probably just use cut:

cut -f 2,3 -d '_' input.file
ADD COMMENTlink written 6 months ago by Damian Kao15k

sorry, all the entries do not have 3 items; that's why I asked the question... Thanks for thinking about this though.... -- Bert

ADD REPLYlink written 6 months ago by bgold040

@cpad0112's solution should work fine.

ADD REPLYlink written 6 months ago by genomax62k

May be you should post those (entries without 3 items) along with entries in OP bgold04

ADD REPLYlink written 6 months ago by cpad011211k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 866 users visited in the last hour