Question: Rearrange The Order Of Two Sequences According To Ids In 27000 Files
1
gravatar for biolab
6.3 years ago by
biolab1.2k
biolab1.2k wrote:

Hi everyone,

I have 27000 fasta files (out_1 to out_27000), each of which contains only two sequences. One sequence ID starts with >AT (Arabidopsis) while the other doesn't start with >AT (another species). For all files, I need to keep Arabidopsis sequence (>AT... ) on top followed by non-Arabidopsis sequence (non >AT...).

I tried to write a perl script, but failed (no error message, but output files are uncorrect). I am really a perl beginner. My script is attached here. Could anyone please help to correct my script or maybe a Linux command for batch running? Thank you very much!! Note: these files are in multi-line sequence format.

#!/usr/bin/perl

use strict;
use warnings;

my $i;
for ($i=1; $i<=27000; $i++){

open my $file1, '<', "out_$i.afa";      #read files;
open my $file2, '>', "newout_$i.afa";   #output to new files;

my %h;                                  #the following stores sequences into a hash;
local $/ = '>';

while (<$file1>) {
    chomp;
    /(.+?)\n(.+)/s and $h{$1} = $2 or next;
}

my $k='';                                #the following runs hash to produce k1 k2 v1 v2;
my $k1='';
my $k2='';
my $v1='';
my $v2='';

foreach $k (keys %h){
        if ($k =~/^\>AT/){
                $k1=$k; $v1=$h{$k};        # k1 v1 are Arabidopsis; 
        }else{
                $k2=$k; $v2 = $h{$k};      # k2 v2 are another species;
        }
}

print $file2 ">$k1\n$v1\n>$k2\n$v2";

close $file1;
close $file2;
}
perl • 2.4k views
ADD COMMENTlink modified 6.3 years ago by Kenosis1.2k • written 6.3 years ago by biolab1.2k
2

Why do you need this?

ADD REPLYlink written 6.3 years ago by PoGibas4.8k

Pgibas, Thanks for your attention. I finally sorted out the problem myself. The condition $k =~/^\>AT/ should be changed to $k =~/^AT/. Actually these 27000 files are orthologous gene seqeunces of Arabidopsis thaliana and Arabidopsis lyrata. This is only an intermediate step. As the next steps, I need to merge all 27000 files into one file, which should be well ordered, i.e., each A.thaliana gene followed by its orthologous A.lyrata gene, and then subjected to miRNA target prediction. The input format is compulsory, that's why I need to make Athaliana sequence on the top followed by Alyrata sequence in each of these 27000 files.

ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by biolab1.2k
3
gravatar for Daniel
6.3 years ago by
Daniel3.8k
Cardiff University
Daniel3.8k wrote:

There's an easier way to do this. Personally I would combine all the sequences into one file:

cat * > all_sequences.fasta

and then (assuming that your sequences are on one line) pull out the sequences with

grep '^>AT' -A 1 all_sequences.fasta > all_AT.fasta

NOTE: If your sequences are not all on one line (they have line breaks every 50 bases or so), I'd save this then run it first top make it into one line:

#!/usr/bin/perl
use 5.010;

$filename = $ARGV[0];   
chomp(@lines = <>);

foreach $line (@lines){
        if($line =~ /^>/){
                $line = "\n$line\n";
        }
        if($line != /^$/){
                print "$line";
        }
}
ADD COMMENTlink written 6.3 years ago by Daniel3.8k
1

If the OP did want to combine all AT from the 27000 files (multi-line sequences or not), you could also just:

perl -nE 'BEGIN{$/=">"}chomp;say">$_"if/^AT/' * > all_AT.fasta

ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by Kenosis1.2k

Sorry, I realise I misread your post, I thought you just wanted to pull them out. Hope this might come in useful one day...

ADD REPLYlink written 6.3 years ago by Daniel3.8k

Mabeuf, Thanks a lot anyway! My post is a little bit ambigous. To me, your code is really a good stuff to learn.

ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by biolab1.2k
1
gravatar for Kenosis
6.3 years ago by
Kenosis1.2k
Kenosis1.2k wrote:

The following take a little different approach:

use strict;
use warnings;
use autodie;

local ( $/, $" ) = ( '>', "\n" );

for my $i ( 1 .. 27000 ) {
    my @arr;
    open my $file1, '<', "out_$i.afa";

    while (<$file1>) {
        chomp;
        $arr[ /^AT/ ? 0 : 1 ] = ">$_";
    }

    close $file1;

    open my $file2, '>', "newout_$i.afa";
    print $file2 "@arr";
    close $file2;
}
  • It uses autodie to handle open/close errors
  • The for my $i (1 .. 2700) { ... is more Perl-like than (and usually preferred over) the C-style loops
  • It uses an array and a ternary operator to hold the sequences in their proper place.
  • it sets $" to \n, forcing newlines between array elements when an array is interpolated, i.e., when an array is placed within a string

Hope this helps!

ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by Kenosis1.2k
1

Hi Kenosis, thanks a lot! Your approach is much better. It really helps.

ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by biolab1.2k

You're most welcome!

ADD REPLYlink written 6.3 years ago by Kenosis1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1255 users visited in the last hour