Question: How to remove space in headers of fasta files
0
gravatar for Crystal
4.2 years ago by
Crystal30
United States
Crystal30 wrote:

Hi,

I have a database with thousands sequences.

The format of the header for each sequence is like:

VFG0676 lef - anthrax toxin, lef, bacteria name (VF0142)

Since there are several space in the header, when I use alignment tool to blast my samples to this database, the only thing showed on my result is VFG0676.

Is there any way I can remove all the space in the head, so that my results can show the full description of the header?

I also want to extract sequence headers in this database, but now the extracted result only displays list of VFGs, no other information.

Can anyone help me with this.

Thanks

Crystal

sequence next-gen • 5.4k views
ADD COMMENTlink modified 4.2 years ago by Brian Bushnell16k • written 4.2 years ago by Crystal30
4
gravatar for apelin20
4.2 years ago by
apelin20470
Canada
apelin20470 wrote:
sed 's, ,_,g' -i FASTA_file

Let me know f you need any help.

ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by apelin20470

Excuse me, is there a output file by using the code?

ADD REPLYlink written 4.2 years ago by Crystal30
1

The -i option edits in place. If you remove it then just redirect to a file.

ADD REPLYlink written 4.2 years ago by Devon Ryan90k

Thank you. Then I tried to extract all the headers in that file, but now the format is 

VFG0676_lef_

and still didn't show the rest of the information.

I do went back and check the edited file, and the format of the headers is

VFG0676_lef_-_anthrax_toxin,_lef,_bacteria_name_(VF0142)

So I don't know if the problem is due to the code i used to extract headers from the file.

Thanks

ADD REPLYlink written 4.2 years ago by Crystal30
1

seems like the "-" is somehow problematic. After doing

sed 's, ,_,g' -i FASTA_file

Try

sed 's,-,,g' -i FASTA_file

This should remove the - from the header. Normally, - aren't a problem but you can still remove it.

ADD REPLYlink written 4.2 years ago by apelin20470

Well, now the extracted headers are longer, but still not the full descriptions.  :(

The format is VFG0676_lef_-_anthrax_toxin

Also I forgot to mention that there is [] for bacteria name, it is like [bacteria_name].

ADD REPLYlink written 4.2 years ago by Crystal30
1

great.... looks like the commas a problem too.... we shall slay them as well!

sed 's.,..g' -i FASTA_file
ADD REPLYlink written 4.2 years ago by apelin20470
1

I think I also need to remove "[]" and "()" in the headers, too.

should I use code like:

sed 's,(),,g' -i FASTA_file
sed 's,[],,g' -i FASTA_file

OR 

sed 's,[,,g' -i FASTA_file
sed 's,],,g' -i FASTA_file
sed 's,(,,g' -i FASTA_file
sed 's,),,g' -i FASTA_file

PS: As a noob to this forum, I can only post five messages/day. 

Thanks

ADD REPLYlink written 4.2 years ago by Crystal30
1

sed 's,(),,g' will remove "()", not "(" and ")" individually. You could do sed 's/[()\[]//g;s/\]//g' to remove [,],(, and) in a single go.

BTW, there's probably a shorter way of doing that, but I can't get sed to allow [ and ] together in a list...

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by Devon Ryan90k

If you are on OS X and want to edit in place, it is slightly different: sed -i '' 's/ /_/g' foo.fa

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by Alex Reynolds28k
3
gravatar for Brian Bushnell
4.2 years ago by
Walnut Creek, USA
Brian Bushnell16k wrote:

BBTools has a read reformatter which will replace all of the whitespace in headers with underscores:

reformat.sh in=reads.fasta out=fixed.fasta addunderscore

ADD COMMENTlink modified 4.2 years ago • written 4.2 years ago by Brian Bushnell16k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1587 users visited in the last hour