How to remove space in headers of fasta files
3
0
Entering edit mode
6.7 years ago
Crystal ▴ 50

Hi,

I have a database with thousands sequences.

The format of the header for each sequence is like:

VFG0676 lef - anthrax toxin, lef, bacteria name (VF0142)

Since there are several space in the header, when I use alignment tool to blast my samples to this database, the only thing showed on my result is VFG0676.

Is there any way I can remove all the space in the head, so that my results can show the full description of the header?

I also want to extract sequence headers in this database, but now the extracted result only displays list of VFGs, no other information.

Can anyone help me with this.

Thanks

Crystal

next-gen sequence • 10k views
ADD COMMENT
5
Entering edit mode
6.7 years ago
apelin20 ▴ 480
sed 's, ,_,g' -i FASTA_file

Let me know f you need any help.

ADD COMMENT
0
Entering edit mode

Excuse me, is there a output file by using the code?

ADD REPLY
2
Entering edit mode

The -i option edits in place. If you remove it then just redirect to a file.

ADD REPLY
0
Entering edit mode

Thank you. Then I tried to extract all the headers in that file, but now the format is 

VFG0676_lef_

and still didn't show the rest of the information.

I do went back and check the edited file, and the format of the headers is

VFG0676_lef_-_anthrax_toxin,_lef,_bacteria_name_(VF0142)

So I don't know if the problem is due to the code i used to extract headers from the file.

Thanks

ADD REPLY
1
Entering edit mode

seems like the "-" is somehow problematic. After doing

sed 's, ,_,g' -i FASTA_file

Try

sed 's,-,,g' -i FASTA_file

This should remove the - from the header. Normally, - aren't a problem but you can still remove it.

ADD REPLY
0
Entering edit mode

Well, now the extracted headers are longer, but still not the full descriptions.  :(

The format is VFG0676_lef_-_anthrax_toxin

Also I forgot to mention that there is [] for bacteria name, it is like [bacteria_name].

ADD REPLY
1
Entering edit mode

great.... looks like the commas a problem too.... we shall slay them as well!

sed 's.,..g' -i FASTA_file
ADD REPLY
1
Entering edit mode

I think I also need to remove "[]" and "()" in the headers, too.

should I use code like:

sed 's,(),,g' -i FASTA_file
sed 's,[],,g' -i FASTA_file

OR 

sed 's,[,,g' -i FASTA_file
sed 's,],,g' -i FASTA_file
sed 's,(,,g' -i FASTA_file
sed 's,),,g' -i FASTA_file

PS: As a noob to this forum, I can only post five messages/day. 

Thanks

ADD REPLY
1
Entering edit mode

sed 's,(),,g' will remove "()", not "(" and ")" individually. You could do sed 's/[()\[]//g;s/\]//g' to remove [,],(, and) in a single go.

BTW, there's probably a shorter way of doing that, but I can't get sed to allow [ and ] together in a list...

ADD REPLY
0
Entering edit mode

If you are on OS X and want to edit in place, it is slightly different: sed -i '' 's/ /_/g' foo.fa

ADD REPLY
4
Entering edit mode
6.7 years ago

BBTools has a read reformatter which will replace all of the whitespace in headers with underscores:

reformat.sh in=reads.fasta out=fixed.fasta addunderscore

ADD COMMENT

Login before adding your answer.

Traffic: 2330 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6