Rearrange The Order Of Two Sequences According To Ids In 27000 Files
2
1
Entering edit mode
7.2 years ago
biolab ★ 1.3k

Hi everyone,

I have 27000 fasta files (out_1 to out_27000), each of which contains only two sequences. One sequence ID starts with >AT (Arabidopsis) while the other doesn't start with >AT (another species). For all files, I need to keep Arabidopsis sequence (>AT... ) on top followed by non-Arabidopsis sequence (non >AT...).

I tried to write a perl script, but failed (no error message, but output files are uncorrect). I am really a perl beginner. My script is attached here. Could anyone please help to correct my script or maybe a Linux command for batch running? Thank you very much!! Note: these files are in multi-line sequence format.

#!/usr/bin/perl

use strict;
use warnings;

my $i;
for ($i=1; $i<=27000; $i++){

open my $file1, '<', "out_$i.afa";      #read files;
open my $file2, '>', "newout_$i.afa";   #output to new files;

my %h;                                  #the following stores sequences into a hash;
local $/ = '>';

while (<$file1>) {
    chomp;
    /(.+?)\n(.+)/s and $h{$1} = $2 or next;
}

my $k='';                                #the following runs hash to produce k1 k2 v1 v2;
my $k1='';
my $k2='';
my $v1='';
my $v2='';

foreach $k (keys %h){
        if ($k =~/^\>AT/){
                $k1=$k; $v1=$h{$k};        # k1 v1 are Arabidopsis; 
        }else{
                $k2=$k; $v2 = $h{$k};      # k2 v2 are another species;
        }
}

print $file2 ">$k1\n$v1\n>$k2\n$v2";

close $file1;
close $file2;
}
perl • 2.6k views
ADD COMMENT
2
Entering edit mode

Why do you need this?

ADD REPLY
0
Entering edit mode

Pgibas, Thanks for your attention. I finally sorted out the problem myself. The condition $k =~/^\>AT/ should be changed to $k =~/^AT/. Actually these 27000 files are orthologous gene seqeunces of Arabidopsis thaliana and Arabidopsis lyrata. This is only an intermediate step. As the next steps, I need to merge all 27000 files into one file, which should be well ordered, i.e., each A.thaliana gene followed by its orthologous A.lyrata gene, and then subjected to miRNA target prediction. The input format is compulsory, that's why I need to make Athaliana sequence on the top followed by Alyrata sequence in each of these 27000 files.

ADD REPLY
3
Entering edit mode
7.2 years ago
Daniel ★ 3.8k

There's an easier way to do this. Personally I would combine all the sequences into one file:

cat * > all_sequences.fasta

and then (assuming that your sequences are on one line) pull out the sequences with

grep '^>AT' -A 1 all_sequences.fasta > all_AT.fasta

NOTE: If your sequences are not all on one line (they have line breaks every 50 bases or so), I'd save this then run it first top make it into one line:

#!/usr/bin/perl
use 5.010;

$filename = $ARGV[0];   
chomp(@lines = <>);

foreach $line (@lines){
        if($line =~ /^>/){
                $line = "\n$line\n";
        }
        if($line != /^$/){
                print "$line";
        }
}
ADD COMMENT
1
Entering edit mode

If the OP did want to combine all AT from the 27000 files (multi-line sequences or not), you could also just:

perl -nE 'BEGIN{$/=">"}chomp;say">$_"if/^AT/' * > all_AT.fasta

ADD REPLY
0
Entering edit mode

Sorry, I realise I misread your post, I thought you just wanted to pull them out. Hope this might come in useful one day...

ADD REPLY
0
Entering edit mode

Mabeuf, Thanks a lot anyway! My post is a little bit ambigous. To me, your code is really a good stuff to learn.

ADD REPLY
1
Entering edit mode
7.2 years ago
Kenosis ★ 1.3k

The following take a little different approach:

use strict;
use warnings;
use autodie;

local ( $/, $" ) = ( '>', "\n" );

for my $i ( 1 .. 27000 ) {
    my @arr;
    open my $file1, '<', "out_$i.afa";

    while (<$file1>) {
        chomp;
        $arr[ /^AT/ ? 0 : 1 ] = ">$_";
    }

    close $file1;

    open my $file2, '>', "newout_$i.afa";
    print $file2 "@arr";
    close $file2;
}
  • It uses autodie to handle open/close errors
  • The for my $i (1 .. 2700) { ... is more Perl-like than (and usually preferred over) the C-style loops
  • It uses an array and a ternary operator to hold the sequences in their proper place.
  • it sets $" to \n, forcing newlines between array elements when an array is interpolated, i.e., when an array is placed within a string

Hope this helps!

ADD COMMENT
1
Entering edit mode

Hi Kenosis, thanks a lot! Your approach is much better. It really helps.

ADD REPLY
0
Entering edit mode

You're most welcome!

ADD REPLY

Login before adding your answer.

Traffic: 1660 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6