Question: help to edit fasta header
0
gravatar for mario.t.murakami
2.5 years ago by
mario.t.murakami0 wrote:

Dear All,

I would like to edit the header of a multifasta file, but I am not so familiar with scripting.

Basically, I want to remove information within parenthesis (including the symbols) and remove bracket symbols, but mantaining the information within. for instance,

>gi|745831934|gb|AJD39620.1| protein A (plasmid) [Homo sapiens]

>gi|745831934|gb|AJD39620.1| protein A Homo sapiens

Thanks in advance

alignment sequence • 1.3k views
ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by mario.t.murakami0
2
gravatar for venu
2.5 years ago by
venu5.3k
Germany
venu5.3k wrote:
perl -pe 's/\(.*\)//' file.faa | sed 's/\[//' | sed 's/\]//'

or (Updated)

perl -pe 's/\(.*\)//' file.faa | perl -pe 's/\[//' | perl -pe 's/\]//'

 

ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by venu5.3k

Dear Venu,

Thanks for your help. When I run your suggested command line I got the following error:

sed: -e expression #1, char 1: unknown command: ` ' '

 

Please, tell me how to write into an output file and not only printing on the screen. thanks again.

 

ADD REPLYlink written 2.5 years ago by mario.t.murakami0

You can redirect the output to a new file like this:

$ perl -pe 's/\(.*\)//' file.faa | sed 's/\[//' | sed 's/\]//' > new_file_name

ADD REPLYlink written 2.5 years ago by genomax51k

Yes, I did it. but it is just writing the original file. PS: I tested it without the sed commands.

ADD REPLYlink written 2.5 years ago by mario.t.murakami0

Are you saying that you are still getting the sed error you mentioned above? Are you using the command exactly as provided by @venu (with single quote characters)?

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by genomax51k

I mean that the changes that should be done by the command are not written in the output file.

by the way the command line updated by @venu is not working the second and third part.  it is now only printing like that:

original: >gi|745831934|gb|AJD39620.1| protein A (plasmid) [Homo sapiens]

printed: >gi|745831934|gb|AJD39620.1| protein A [Homo sapiens]

perhaps is it related with fact that I am using strawberry perl in windows os?

thanks

 

ADD REPLYlink written 2.5 years ago by mario.t.murakami0

Obviously there is perl on unix and the perl you are using on windows, which does not appear to work the way we expect it to on unix.

Is this file large to not be able to do this using an editor in windows?

ADD REPLYlink written 2.5 years ago by genomax51k

yes, there are 400 sequences. I will run the script on linux and it will probably work fine.

thanks

ADD REPLYlink written 2.5 years ago by mario.t.murakami0

Its perfectly working fine. I don't know why you are getting error. I am updating answer. Direct the output to new file as @genomax said. 

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by venu5.3k

thanks both. On linux, it worked fine. 

ADD REPLYlink written 2.5 years ago by mario.t.murakami0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1612 users visited in the last hour