Make multi-fasta using perl
3
0
Entering edit mode
6.6 years ago
sun.nation ▴ 120

Header and sequence are coming in loop:

$header = "seq1";$sequence = "GTGGTCAGAAAAGTGGCCTCAGCCCATTTCCAGCATCCTATGGCTATCTAAGTAGGAAGACGATGAGCTATGGAA";

Sequence is single line without newline.

I want output multi-fasta like this below. Sequence should have newline after each 60 bases.

>seq1
GTGGTCAGAAAAGTGGCCTCAGCCCATTTCCAG
CATCCTATGGCTATCTAAGTAGGAAGACGATGA
GCTATGGAA

>seq2
AAGTGGCCTCAGCCCATTTCCAGCATCCTATGGC
TATCTAAGTAGGAAGACGATGAGCTATGGAA

Thanks

perl fasta multi-fasta • 2.7k views
1
Entering edit mode

This was asked before (Using Perl Text::Wrap For Wrapping Text) and the solution (pre-v5.10) is:

$sequence =~ s/(.{60})/$1\n/gs;

That formats the sequence, then you just need to join the id and sequence with a newline when you print.

0
Entering edit mode

Thanks a lot SES.

It was a fantastic solution with a line code. It worked good when I added this in my script.

0
Entering edit mode

Can you please give me some hint to create loop in perl? I need that at the end of my perl code.

Thanks Ram and Dylan

0
Entering edit mode

Look at the Text::Wrap module if you need a perl specific and easy method.

0
Entering edit mode

Thanks. I will try that

0
Entering edit mode

I think the following lines could be an alternative solution.

print ">$header\n";$seq_len = length($sequence); for ($i = 0; $i <$seq_len; $i + 60) {$segment = substr($sequence,$i, 60);

print "$segment\n"; } ADD REPLY 0 Entering edit mode Yes, the loop is a good basic alternative, but the RegEx is better, this being Perl. ADD REPLY 3 Entering edit mode The problem with the regex is that it is limited to a max sequence length of ~32k, which gives you trouble with contigs, etc. Here is a loop solution that should be faster than regex. my$single_line_seq = "ATTGC"x1000;
my $line_width = 80; print$_,"\n" for unpack("(A$line_width)*",$single_line_seq);
3
Entering edit mode
6.6 years ago
SES 8.5k

Here is a comparison of the two methods:

Here are the results:

$perl biostars140255.pl 2>/dev/null Benchmark: timing 10 iterations of regex, unpack... regex: 17 wallclock secs (15.48 usr + 1.44 sys = 16.92 CPU) @ 0.59/s (n=10) unpack: 66 wallclock secs (29.63 usr + 34.99 sys = 64.62 CPU) @ 0.15/s (n=10) s/iter unpack regex unpack 6.46 -- -74% regex 1.69 282% -- The benchmark shows that regex is about 6X faster than unpack. Also, there is no length limit (the input in the script is a 100 Mb string). Are you referring to a 32-bit system with those limits? ADD COMMENT 1 Entering edit mode How I see SES now: ADD REPLY 0 Entering edit mode I was looking for a mic drop gif but just gave up myself. ADD REPLY 0 Entering edit mode For your viewing pleasure: http://giphy.com/search/mic-drop/3 EDIT: Removing GIF so I don't encourage image uploads just for the sake of uploads. ADD REPLY 0 Entering edit mode Ha! I might have been a bit terse but I honestly wasn't sure until I did the benchmark so I wanted to share. Also, I use 5.20 in the script because there is an awesome optimization (called "copy on write") that is lightening fast for copying strings. Here is an example script: my$x = "A" x 1_000_000;
my $y =$x for 1...1_000_000;

That makes a long string and copies it a bunch of times in memory. Super useful for genome analysis. With Perl version 5.10.1 that takes 250.8 seconds on my machine. With Perl 5.20.2, that takes 0.15 seconds!

0
Entering edit mode

I had more than 200k in a single fasta, regex worked good. Thanks

0
Entering edit mode

Okay, my bad - I shouldn't have mentioned speed in the first place. unpack can be faster, especially if I'm not too lazy to put print outside the loop. I am also sure, one could optimize things further...

'unpack' => sub {
say STDERR join("\n", unpack("(A60)*", \$seq));
}

Rate  regex unpack
regex  1.74/s     --   -47%
unpack 3.27/s    88%     --


Regarding the limit of 32766, I got confused there. It only applies to a maximum size of the capture group and hasn't any effect here, unless you want your lines to be longer than 32 kbp ;)

0
Entering edit mode
6.6 years ago
Ram 35k

Either use a loop and introduce a new line characters after every 60 characters in the sequence, or use Picard's NormalizeFasta if you'd like a tried and tested method.

0
Entering edit mode

You can also try cat *.fasta | fold > all.fasta if you know that the headers are less than 60 bp.

0
Entering edit mode
6.6 years ago
Prakki Rama ★ 2.5k

fasta_formatter -i input.fasta -o output.fasta -w 60