Question: Control concatenation order in a fasta file
0
gravatar for Damian_Civ
10 months ago by
Damian_Civ0
Damian_Civ0 wrote:

Hello, Biostars community I would like to receive your support with the next question.

I have developed this simple script to concatenate fasta sequences from different files in one single resulting file according to the same ID of each sequence respectively and It works, nevertheless, I would like to know if there is a way to control the order of the concatenation according to the name of the file.

For example

fie1_OTIB.fasta

>seq1
aaa
>seq2
ccc

file2_OTIA.fasta

>seq1
ttt
>seq2
ggg

file3_OTSS.fasta

>seq1
ggg
>seq2
ttt

file4_OTLS

>seq1
ccc
>seq2
aaa

For now, the script is going to give me this result in one fasta file

>seq1
TTTAAAGGGCCC

Order for the seq1 >> file2 file1 file3 file4

>seq2
GGGCCCTTTAAA

Order for the seq2 >> file2 file1 file3 file4

Here I would like to know if I could modify the script to order the concatenation according to my preference file due that, for now, it is taking a predetermined order, I would like in one case for example

>seq1
AAATTTGGGCCC

ORDER = file1 file2 file3 file4

>seq2
AAATTTGGGCCC

ORDER = file1 file2 file3 file4

Thanks in advance for your valuable support,

use Bio::SeqIO;

$input = "FOLDER_INPUT_FASTA_DIRECTORY";
my %seqs;


opendir DH, $input or die "Cannot open directory: $!";
open (outfile, ">","OUTPUT_DIRECTOTY");

while (my $file = readdir(DH)) {                                                    
    next if $file =~ /^\./;                                                         
    $new_file= $input."/".$file;
    open GB, "$new_file" or die $!;

    my $seqio = Bio::SeqIO->new(-file => $new_file, -format => "fasta");

    while(my $seq = $seqio->next_seq) { 
        my $seq_seq = $seq->seq();
        my $seq_id = $seq->id();

        $seqobj = Bio::Seq->new( -display_id => $seq_id, -seq => $seq_seq);
        $seq_id = $seqobj->id();
        $seqstr   = $seqobj->seq();
        $seqs{$seqobj->display_id} .= $seqobj->seq;   
    }
}

foreach my $key (keys %seqs) { 
    print outfile ">$key\n$seqs{$key}\n";
}
bioperl • 209 views
ADD COMMENTlink modified 10 months ago by RamRS30k • written 10 months ago by Damian_Civ0

do all the files have same number of sequences? Damian_Civ. Try seqkit concat. It concatenates as per ID and file name order.

ADD REPLYlink modified 10 months ago • written 10 months ago by cpad011214k

In your code, opendir will get the directory content in sorted order.

As you mentioned if the names of the files are exactly the same as fie1_OTIB.fasta, file2_OTIA.fasta, file3_OTSS.fasta, file4_OTLS then you will get output as your expecting but if the file names are without file(number)_ prefix then opendir will get the directory content in a different order like OTIA.fasta, OTIB.fasta, OTLS.fasta, OTSS.fasta and your output will be file2 file1 file4 file3.

If this is the problem then renaming the files in lexicographical order would be a dummy solution.

Thanks.

ADD REPLYlink written 10 months ago by Nitin Narwade450
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1892 users visited in the last hour