Rearrange The Order Of Two Sequences According To Ids In 27000 Files
2
1
Entering edit mode
7.2 years ago
biolab ★ 1.3k

Hi everyone,

I have 27000 fasta files (out_1 to out_27000), each of which contains only two sequences. One sequence ID starts with >AT (Arabidopsis) while the other doesn't start with >AT (another species). For all files, I need to keep Arabidopsis sequence (>AT... ) on top followed by non-Arabidopsis sequence (non >AT...).

I tried to write a perl script, but failed (no error message, but output files are uncorrect). I am really a perl beginner. My script is attached here. Could anyone please help to correct my script or maybe a Linux command for batch running? Thank you very much!! Note: these files are in multi-line sequence format.

#!/usr/bin/perl

use strict;
use warnings;

my $i; for ($i=1; $i<=27000;$i++){

open my $file1, '<', "out_$i.afa";      #read files;
open my $file2, '>', "newout_$i.afa";   #output to new files;

my %h;                                  #the following stores sequences into a hash;
local $/ = '>'; while (<$file1>) {
chomp;
/(.+?)\n(.+)/s and $h{$1} = $2 or next; } my$k='';                                #the following runs hash to produce k1 k2 v1 v2;
my $k1=''; my$k2='';
my $v1=''; my$v2='';

foreach $k (keys %h){ if ($k =~/^\>AT/){
$k1=$k; $v1=$h{$k}; # k1 v1 are Arabidopsis; }else{$k2=$k;$v2 = $h{$k};      # k2 v2 are another species;
}
}

print $file2 ">$k1\n$v1\n>$k2\n$v2"; close$file1;
close $file2; }  perl • 2.6k views ADD COMMENT 2 Entering edit mode Why do you need this? ADD REPLY 0 Entering edit mode Pgibas, Thanks for your attention. I finally sorted out the problem myself. The condition $k =~/^\>AT/ should be changed to $k =~/^AT/. Actually these 27000 files are orthologous gene seqeunces of Arabidopsis thaliana and Arabidopsis lyrata. This is only an intermediate step. As the next steps, I need to merge all 27000 files into one file, which should be well ordered, i.e., each A.thaliana gene followed by its orthologous A.lyrata gene, and then subjected to miRNA target prediction. The input format is compulsory, that's why I need to make Athaliana sequence on the top followed by Alyrata sequence in each of these 27000 files. ADD REPLY 3 Entering edit mode 7.2 years ago Daniel ★ 3.8k There's an easier way to do this. Personally I would combine all the sequences into one file: cat * > all_sequences.fasta  and then (assuming that your sequences are on one line) pull out the sequences with grep '^>AT' -A 1 all_sequences.fasta > all_AT.fasta  NOTE: If your sequences are not all on one line (they have line breaks every 50 bases or so), I'd save this then run it first top make it into one line: #!/usr/bin/perl use 5.010;$filename = $ARGV[0]; chomp(@lines = <>); foreach$line (@lines){
if($line =~ /^>/){$line = "\n$line\n"; } if($line != /^$/){ print "$line";
}
}

1
Entering edit mode

If the OP did want to combine all AT from the 27000 files (multi-line sequences or not), you could also just:

perl -nE 'BEGIN{$/=">"}chomp;say">$_"if/^AT/' * > all_AT.fasta

0
Entering edit mode

Sorry, I realise I misread your post, I thought you just wanted to pull them out. Hope this might come in useful one day...

0
Entering edit mode

Mabeuf, Thanks a lot anyway! My post is a little bit ambigous. To me, your code is really a good stuff to learn.

1
Entering edit mode
7.2 years ago
Kenosis ★ 1.3k

The following take a little different approach:

use strict;
use warnings;
use autodie;

local ( $/,$" ) = ( '>', "\n" );

for my $i ( 1 .. 27000 ) { my @arr; open my$file1, '<', "out_$i.afa"; while (<$file1>) {
chomp;
$arr[ /^AT/ ? 0 : 1 ] = ">$_";
}

close $file1; open my$file2, '>', "newout_$i.afa"; print$file2 "@arr";
close $file2; }  • It uses autodie to handle open/close errors • The for my$i (1 .. 2700) { ... is more Perl-like than (and usually preferred over) the C-style loops
• It uses an array and a ternary operator to hold the sequences in their proper place.
• it sets \$" to \n, forcing newlines between array elements when an array is interpolated, i.e., when an array is placed within a string

Hope this helps!

1
Entering edit mode

Hi Kenosis, thanks a lot! Your approach is much better. It really helps.

0
Entering edit mode

You're most welcome!