help to edit fasta header
1
0
Entering edit mode
8.2 years ago

Dear All,

I would like to edit the header of a multifasta file, but I am not so familiar with scripting.

Basically, I want to remove information within parenthesis (including the symbols) and remove bracket symbols, but miantaining the information within. for instance,

>gi|745831934|gb|AJD39620.1| protein A (plasmid) [Homo sapiens]
>gi|745831934|gb|AJD39620.1| protein A Homo sapiens

Thanks in advance

sequence alignment • 2.8k views
ADD COMMENT
2
Entering edit mode
8.2 years ago
venu 7.1k
perl -pe 's/\(.*\)//' file.faa | sed 's/\[//' | sed 's/\]//'

or (Updated)

perl -pe 's/\(.*\)//' file.faa | perl -pe 's/\[//' | perl -pe 's/\]//'
ADD COMMENT
0
Entering edit mode

Dear Venu,

Thanks for your help. When I run your suggested command line I got the following error:

sed: -e expression #1, char 1: unknown command: ` ' '

Please, tell me how to write into an output file and not only printing on the screen. thanks again.

ADD REPLY
0
Entering edit mode

You can redirect the output to a new file like this:

$ perl -pe 's/\(.*\)//' file.faa | sed 's/\[//' | sed 's/\]//' > new_file_name
ADD REPLY
0
Entering edit mode

Yes, I did it. but it is just writing the original file. PS: I tested it without the sed commands.

ADD REPLY
0
Entering edit mode

Are you saying that you are still getting the sed error you mentioned above? Are you using the command exactly as provided by @venu (with single quote characters)?

ADD REPLY
0
Entering edit mode

I mean that the changes that should be done by the command are not written in the output file.

by the way the command line updated by @venu is not working the second and third part. it is now only printing like that:

original: >gi|745831934|gb|AJD39620.1| protein A (plasmid) [Homo sapiens]
printed: >gi|745831934|gb|AJD39620.1| protein A [Homo sapiens]

perhaps is it related with fact that I am using strawberry perl in windows os?

thanks

ADD REPLY
0
Entering edit mode

Obviously there is perl on unix and the perl you are using on windows, which does not appear to work the way we expect it to on unix.

Is this file large to not be able to do this using an editor in windows?

ADD REPLY
0
Entering edit mode

yes, there are 400 sequences. I will run the script on linux and it will probably work fine.

thanks

ADD REPLY
0
Entering edit mode

It's perfectly working fine. I don't know why you are getting error. I am updating answer. Direct the output to new file as @genomax said.

ADD REPLY
0
Entering edit mode

thanks both. On linux, it worked fine.

ADD REPLY

Login before adding your answer.

Traffic: 1571 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6