Question: Please help with removing spaces from fasta file
1
gravatar for seta
4.9 years ago by
seta1.4k
Sweden
seta1.4k wrote:

Hi all,

I'm dealing with a fasta file with spaces at the end of line, which caused the problem. I didn't find a suitable way to remove them. Please kindly tell me the appropriate command for removing them?

ADD COMMENTlink modified 4.9 years ago by Atu0 • written 4.9 years ago by seta1.4k
2
gravatar for dschika
4.9 years ago by
dschika300
European Union
dschika300 wrote:
sed 's/ *$//g' in.fasta > out.fasta

will remove only spaces at the end of lines. To remove tab or space use:

sed 's/\s*$//g' in.fasta > out.fasta
ADD COMMENTlink modified 11 months ago by _r_am31k • written 4.9 years ago by dschika300
1

Note for sed on Mac OS X, you have to use [[:space:]] instead of \s:

sed "s/[[:space:]]*$//g" in.fasta > out.fasta
ADD REPLYlink modified 11 months ago by _r_am31k • written 4.9 years ago by Jean-Karim Heriche23k
1
gravatar for Jean-Karim Heriche
4.9 years ago by
EMBL Heidelberg, Germany
Jean-Karim Heriche23k wrote:

Not a bioinformatics questions, you should try Stack Overflow for this, but here is a quick answer in perl:

perl -i.bak -pe 's/\h+$//' sequences.fa
ADD COMMENTlink modified 11 months ago by _r_am31k • written 4.9 years ago by Jean-Karim Heriche23k

Thanks. I tried the command, but the whole sequences within the file was removed, so that grep -c ">" file1.fa returned 0

ADD REPLYlink modified 11 months ago by _r_am31k • written 4.9 years ago by seta1.4k
1

Try the following:

perl -i.bak -pe "s/\s+$/\n/;" sequences.fa

Note that this will remove all trailing whitespace characters from each line (including newline), and replace with a single newline.

ADD REPLYlink modified 11 months ago by _r_am31k • written 4.9 years ago by harold.smith.tarheel4.6k

What's your perl version ? The \h character class was introduced in perl 5.10.

ADD REPLYlink written 4.9 years ago by Jean-Karim Heriche23k

It's v5.18.2.

ADD REPLYlink modified 11 months ago by _r_am31k • written 4.9 years ago by seta1.4k

Not sure why it didn't work for you. I tested it with 5.12 and 5.20 and it worked fine.

ADD REPLYlink written 4.9 years ago by Jean-Karim Heriche23k
1
gravatar for Antonio R. Franco
4.9 years ago by
Spain. Universidad de Córdoba
Antonio R. Franco4.5k wrote:

Is that you have an space or a lack of the end of line code?

If your data are tab separated, and you have an space only at the end of the lane, you can do the following

cat file.fasta | tr -d " " > newfile.fasta

But notice that this will get rid of all spaces, including those at the middle of the lane.

ADD COMMENTlink modified 11 months ago by _r_am31k • written 4.9 years ago by Antonio R. Franco4.5k
1
gravatar for biocyberman
4.9 years ago by
biocyberman810
Denmark
biocyberman810 wrote:

Oh my gawk!

All previous solutions would risk modifying your fasta header as well. This one will not.

gawk 'BEGIN{line=0}{ if ($0 !~/^>/ && $0 ~/ +/ ) {gsub(/ +/, //); line++} print}END{print line" lines with white spaces treated" > "/dev/stderr"}' myfasta.fa >output.fa

If you only want to remove the spaces at the end of the lines:

gawk 'BEGIN{line=0}{ if ($0 !~/^>/ && $0 ~/ +$/ ) {gsub(/ +$/, //); line++} print}END{print line" lines with white spaces treated" > "/dev/stderr"}' myfasta.fa>output.fa
ADD COMMENTlink modified 11 months ago by _r_am31k • written 4.9 years ago by biocyberman810

It's true that the solution with sed could also alter the fasta header.

But: have you ever been in a situation where removing whitespaces at the end (!) of the header would mess up something? I hope not ;)

ADD REPLYlink written 4.9 years ago by dschika300

To be fair, that is at low probability :-)

ADD REPLYlink modified 4.9 years ago • written 4.9 years ago by biocyberman810

Yes, they could alter the header but only by removing white space from the end of it (the $ sigil anchors the match at the end of the line). The problem reported was with white spaces at the end of lines, whether the problem was limited to non-header lines wasn't specified.

ADD REPLYlink written 4.9 years ago by Jean-Karim Heriche23k

I was just being paranoid and want to present gawk-based solution :-)

ADD REPLYlink modified 11 months ago by _r_am31k • written 4.9 years ago by biocyberman810
0
gravatar for Atu
4.9 years ago by
Atu0
Spain
Atu0 wrote:

Hi,

I think you could make use of the python rstrip() string method. Just call it while reading your fasta file, and it will handle the the white spaces as you want.

for line in open('path_to_fasta_file'):
    print line.rstrip()

Copy the code into a file, say my_script.py, and run

python my_script.py

There you go

ADD COMMENTlink modified 11 months ago by _r_am31k • written 4.9 years ago by Atu0
1

Wouldn't this also strip the newline characters?

ADD REPLYlink written 4.9 years ago by Tej Sowpati250

Yes,

any trailing character will be removed (white spaces plus newline character), but newline characters will be added again by the print fuction. So the output FASTA should be well-formed.

Happy New Year!

ADD REPLYlink written 4.9 years ago by Atu0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1015 users visited in the last hour