Question: Help With Perl Script: Store Fasta Sequences Into A Hash.
0
gravatar for biolab
4.8 years ago by
biolab1.0k
biolab1.0k wrote:

Hi everyone I am working on a fasta file. I want to format it to a hash(SeqID as the key and Sequence as the value). I write a script, but somewhere is wrong. Could you please indicate for me? Thank you very much!

#!/bin/perl
use strict;
use warnings;

open IN, $ARGV[0];
while (<>){

$_ =~ s/[\r\n]/\t/g;  ##replace all newlines with tabs

my @a;
my %h; 
@a = split (/\t/, $_);  ##change to array

my $i;                  ##change array to hash
for ($i=0, $i<=$#a/2, $i++){
    my $id = shift @a;
    my $seq = shift @a;
    $h{$id} = $seq;
}
}
close IN;
perl • 6.5k views
ADD COMMENTlink modified 4.8 years ago by Neilfws48k • written 4.8 years ago by biolab1.0k
5
gravatar for Varun Gupta
4.8 years ago by
Varun Gupta1.0k
United States
Varun Gupta1.0k wrote:
    #!usr/bin/perl
    use strict;
    use warnings;

    my %id2seq = ();
    my $id = '';
    open F,"test.fa",or die $!;
    while(<F>){
        chomp;
        if($_ =~ /^>(.+)/){
            $id = $1;
        }else{
            $id2seq{$id} .= $_;
        }
    }
close F;

Hope this helps. You can then use the foreach loop to loop over the keys of the hash and manipulate the sequences associated with each id accordingly.

Varun

ADD COMMENTlink written 4.8 years ago by Varun Gupta1.0k

Really helpful again! Thanks!

ADD REPLYlink written 4.8 years ago by biolab1.0k
5
gravatar for Neilfws
4.8 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

Bioperl provides libraries for sequence parsing so as you don't have to write them. Have a look at Bio::SeqIO.

use strict;
use Bio::SeqIO;

my %sequences;
my $seqio = Bio::SeqIO->new(-file => "myfastafile.fa", -format => "fasta");
while(my$seqobj = $seqio->next_seq) {
    my $id  = $seqobj->display_id;    # there's your key
    my $seq = $seqobj->seq;           # and there's your value
    $sequences{$id} = $seq;
}
ADD COMMENTlink modified 4.8 years ago • written 4.8 years ago by Neilfws48k
1

-format => "fasta" is programmatically unnecessary here, since Bio::SeqIO 'knows' the format by the file's extension.

Why not just:

while ( my $seqobj = $seqio->next_seq ) {
    $sequences{ $seqobj->display_id } = $seqobj->seq;
}

since the object's methods are self-documenting?

ADD REPLYlink modified 4.8 years ago • written 4.8 years ago by Kenosis1.2k
1

Because being explicit is helpful for beginners.

ADD REPLYlink written 4.8 years ago by Neilfws48k
2
gravatar for Kenosis
4.8 years ago by
Kenosis1.2k
Kenosis1.2k wrote:

Since the sequences can be multi-line, you can set Perl's record separator to '>', so one fasta record at a time is read. Then, a regex can be used to capture the id and seq: the id is all before the first space and the seq is all after the first newline. At this point, you can add the id/seq pair as a key/val pair in a hash, where $1 is the id and $2 is the seq:

use strict;
use warnings;

my %h;
local $/ = '>';

while (<>) {
    chomp;
    /(\w+).+?\n(.+)/s and $h{$1} = $2 or next;
}

Hope this helps!

ADD COMMENTlink modified 4.8 years ago • written 4.8 years ago by Kenosis1.2k

It really helps! Thank you very much!

ADD REPLYlink written 4.8 years ago by biolab1.0k

You're most welcome!

ADD REPLYlink written 4.8 years ago by Kenosis1.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 731 users visited in the last hour