Removing (stubborn) new line from Fasta file sequence?
2
1
Entering edit mode
9 weeks ago
Eliveri ▴ 340

I have a fasta file in this format:

>WP_003850266.1 toxin [Corynebacterium diphtheriae]
EQVGTEEFIKRFGDGASRVVLSLPFAEGS
AVHHNT


Which I want it to appear like

>WP_003850266.1 toxin [Corynebacterium diphtheriae]


However for the particular fasta file I have, for some reason no matter what I try, the newlines cannot be removed.

awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}' < test.fasta > output.fasta


But the new lines remain ...

fasta • 410 views
1
Entering edit mode

Try with bioawk, for example or something similar:

bioawk -cfastx '{print ">"$name"\n"$seq}' test.vcf > out.fasta

3
Entering edit mode
9 weeks ago
seidel 11k

Your file has some lines with carriage returns (\r or ^M), but not all:

tail -2 test.fasta | od -c
0000000    S   T   N   S   R   L   C   A   V   F   V   R   S   G   Q   P
0000020    V   I   G   A   C   T   S   P   Y   D   G   K   Y   W   S   M
0000040    Y   S   R   L   R   K   M   L   Y   L   I   Y   V   A   G   I
0000060    S   V   R   V   H   V   S   K   E   E   Q   Y   Y   D   Y   E
0000100    D   A   T   F   E   T  \r  \n   Y   A   L   T   G   I   S   I
0000120    C   N   P   G   S   S   L   C  \n


One easy solution is to simply preface your command with sed to replace the carriage returns with nothing:

sed -e 's/\r//g' test.fasta | awk '/^>/ {printf("\n%s\n",$0);next; } { printf("%s",$0);}  END {printf("\n");}'


The sed part can be read as: substitute/thispattern/forthatpattern/global.

1
Entering edit mode
9 weeks ago
Carambakaracho ★ 3.2k

I still love to solve these things with Perl oneliners.

perl -nwe 'if(s/^>/\n>/){s/\r?\n$/\n/;}else{s/\r?\n$//};print \$_' test.fasta | tail -n +2


Explanation: if you match > at the start, substitute with newline and >: \n> then match optional carriage return \r? and newline \n, replace with \n else match match optional carriage return \r? and newline \n, replace with nothing. Print standard input variable. The tail is required as I didn't include a check for the first line which is an empty line now.

Previously I was convinced Perl regex oneliners are much better than awk as I never cared to learn awk. With more and more time without active Perl development I think I come to acknowledge Perl's picket fencing