Question

Removing (stubborn) new line from Fasta file sequence?

1

Entering edit mode

15 months ago

Eliveri ▴ 350

I have a fasta file in this format:

>WP_003850266.1 toxin [Corynebacterium diphtheriae]
MSRKLFASILIGALLGIGAPPSAHAGADDV
EQVGTEEFIKRFGDGASRVVLSLPFAEGS
AVHHNT

Which I want it to appear like

>WP_003850266.1 toxin [Corynebacterium diphtheriae]
MSRKLFASILIGALLGIGAPPSAHAGADDVEQVGTEEFIKRFGDGASRVVLSLPFAEGSAVHHNT

However for the particular fasta file I have, for some reason no matter what I try, the newlines cannot be removed.

I have already tried

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < test.fasta > output.fasta

But the new lines remain ...

fasta • 738 views

ADD COMMENT • link updated 15 months ago by seidel 11k • written 15 months ago by Eliveri ▴ 350

1

Entering edit mode

Try with bioawk, for example or something similar:

bioawk -cfastx '{print ">"$name"\n"$seq}' test.vcf > out.fasta

ADD REPLY • link 15 months ago by mohammadhassanj ▴ 260

1

Entering edit mode

15 months ago

Carambakaracho ★ 3.2k

I still love to solve these things with Perl oneliners.

perl -nwe 'if(s/^>/\n>/){s/\r?\n$/\n/;}else{s/\r?\n$//};print $_' test.fasta | tail -n +2

Explanation: if you match > at the start, substitute with newline and >: \n> then match optional carriage return \r? and newline \n, replace with \n else match match optional carriage return \r? and newline \n, replace with nothing. Print standard input variable. The tail is required as I didn't include a check for the first line which is an empty line now.

Previously I was convinced Perl regex oneliners are much better than awk as I never cared to learn awk. With more and more time without active Perl development I think I come to acknowledge Perl's picket fencing

ADD COMMENT • link 15 months ago by Carambakaracho ★ 3.2k

score 3 · Accepted Answer · 2023-01-19

Your file has some lines with carriage returns (\r or ^M), but not all:

tail -2 test.fasta | od -c
0000000    S   T   N   S   R   L   C   A   V   F   V   R   S   G   Q   P
0000020    V   I   G   A   C   T   S   P   Y   D   G   K   Y   W   S   M
0000040    Y   S   R   L   R   K   M   L   Y   L   I   Y   V   A   G   I
0000060    S   V   R   V   H   V   S   K   E   E   Q   Y   Y   D   Y   E
0000100    D   A   T   F   E   T  \r  \n   Y   A   L   T   G   I   S   I
0000120    C   N   P   G   S   S   L   C  \n

One easy solution is to simply preface your command with sed to replace the carriage returns with nothing:

sed -e 's/\r//g' test.fasta | awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}'

The sed part can be read as: substitute/thispattern/forthatpattern/global.