How to remove a substring from each line using python 3
2
0
Entering edit mode
2.6 years ago

Hi Folks,

So, I have this header lines...

>CP001830.1_cds_AEH77465.1_1 [locus_tag=SM11_chr0180]  [protein_id=AEH77465.1] [location=195246..195674]
>KI271598.1_cds_ERL64443.1_1  [locus_tag=L248_0985]  [protein_id=ERL64443.1] [location=complement(53545..53919)]
>CR931997.1_cds_CAI37700.1_1 [locus_tag=jk1527] [db_xref=EnsemblGenomes-Gn:jk1527,EnsemblGenomes-Tr:CAI37700,GOA:Q4JU07,InterPro:IPR001185,UniProtKB/TrEMBL:Q4JU07] [protein_id=CAI37700.1] [location=1801511..1801945]
>HE858529.1_cds_CCI62285.1_1 [locus_tag=SDSE_0788] [db_xref=EnsemblGenomes-Gn:SDSE_0788,EnsemblGenomes-Tr:CCI62285,GOA:K4Q7R5,InterPro:IPR001185,InterPro:IPR019823,UniProtKB/TrEMBL:K4Q7R5] [protein_id=CCI62285.1] [location=complement(732360..732734)]

In some lines I have the information "[db_xref=Ensemb...]" , which I want to remove it.

I can not remove everything after this information (e.g. using "sed"), because I need the remaining the line. I tried to used awk or sed. Also, I can not "cut" or print [awk] according to the column because they are not in all lines.

So, it should be better a script using a regular expression - I guess.

However, I cannot figure out... Could you please help?

sequencing • 581 views
ADD COMMENT
0
Entering edit mode

What is unclear after reading the documentation?

ADD REPLY
0
Entering edit mode

Regular expression posted by @JC below should work with sed -r.

ADD REPLY
0
Entering edit mode

I don't see why sed can't do this? E.g.,

sed -e 's/\[db_xref=Ensemb[^]]*\]//g'

ADD REPLY
0
Entering edit mode

For me, it does not work.

ADD REPLY
2
Entering edit mode
2.6 years ago
JC 12k

Perl:

perl -pe 's/\[db_xref=Ensembl.+?\]//g' < input > output

ADD COMMENT
0
Entering edit mode

Hi JC,

Thanks a lot. Save my day.

ADD REPLY
2
Entering edit mode
2.3 years ago
Wayne ▴ 670

Python 3:
In case someone ends up here given the 'Python 3' portion of the OP's question:

Without a regular expression:

output = ""
for line in input:
    if "[db_xref=Ensembl" in line:
        split_on_tag = line.split("[db_xref=Ensembl")
        output += split_on_tag[0] + split_on_tag[1].split("]",1)[1]
    else:
        output += line

With regular expressions:

output = ""
for line in input:
    output += re.sub("\[db_xref=Ensembl.+?\]","",line)

Static view with full run through displayed.

Run and edit the code actively in your browser via MyBinder.org here.

ADD COMMENT

Login before adding your answer.

Traffic: 1879 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6