Question: Problem With Perl Script When Extracting Sequences According To The Ids
2
gravatar for biolab
5.3 years ago by
biolab1.1k
biolab1.1k wrote:

Dear All,

I have a Fasta file and a List file. I need to extract sequences according to the IDs in each line of the List file, and output to a file. An examplified output result is shown below (Files out_1 and out_2). I wrote a perl script to do this job (useage: perl extract_seq.pl fastafile listfile), but got error message : Use of uninitialized value in string eq at ... Could you please point out my error and give me some suggesttions? Thank you very much!!

Fasta file

>a1
tggtacgtgtcagtcaacgtgggggggggggggg
>a2
cgtacgtgacgtacactacgtttttttttttttt
>a3
cgctagacgtataaaaatatatattataaaaaaaa
>a4
cccgcgcggcgcgcggcccccccccccccccccccc
>a5
tggtttttttttttttttttttttttttttttttttttt
>a6
cgtacgggggggggggggggggggggggggggggggg
>a7
cgctagacgtataaaaaaaaaaaaaaaaaaaaaaaaa
>a8
cccgcgcattattattatatattatattttattatat

List file

a1    a3    a4
a5    a6    a8

Ideal output files:

out_1

>a1
tggtacgtgtcagtcaacgtgggggggggggggg
>a3
cgctagacgtataaaaatatatattataaaaaaaa
>a4
cccgcgcggcgcgcggcccccccccccccccccccc

out_2

>a5
tggtttttttttttttttttttttttttttttttttttt
>a6
cgtacgggggggggggggggggggggggggggggggg
>a8
cccgcgcattattattatatattatattttattatat

Problematic Perl script

#!/usr/bin/perl
use strict;
use warnings;


#Firstly store Fasta sequences into a hash;
open SEQ, $ARGV[0];
my $id; 
my %seq;  

while (<SEQ>) {
    chomp;
    if (/^>(.+)/){ 
        $id = $1;
        }else{$seq{$id} .= $_;
        }
}
close SEQ;

#Next open LIST file and generate output files according to IDs in each line;
open LIST, $ARGV[1];
my $i;
while (<LIST>){
        $i ++;
        my @a = (split /\t/, $_);
        open OUT, ">> out_$i";  
        foreach (sort keys %seq){
                if($a[0] eq $_){
                        print OUT ">$_\n$seq{$_}\n";
                }elsif($a[1] eq $_){
                        print OUT ">$_\n$seq{$_}\n";
                }elsif($a[2] eq $_){
                        print OUT ">$_\n$seq{$_}\n";
                }
        }  
        close OUT;
}
close LIST;
perl • 4.0k views
ADD COMMENTlink modified 5.3 years ago by Istvan Albert ♦♦ 80k • written 5.3 years ago by biolab1.1k
3

Don't write your own code for basic tasks that have been scripted by so many other people already. Just google your question and you will easily get working solutions

ADD REPLYlink written 5.3 years ago by Irsan6.8k
4

Exercising is always good. Why not to appreciate the effort made.

ADD REPLYlink written 5.3 years ago by Pavel Senin1.9k
1

OK, I agree. I didn't think of this being an exercise. But sometimes I see people spending 3 days working on a buggy script for a task that can easily be accomplished with well established widely used tools. I think it is just as important to know what is already available.

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by Irsan6.8k
1

Try writing simple bash loop: for each line split IDs & grep from FASTA file & redirect into specific file.

ADD REPLYlink written 5.3 years ago by PoGibas4.8k
1

This site kind of is becoming as site for programming help. Learn debugging your code line for line and try to find the error yourself. A first semester student could see the bug and fix it. This is no scientific question but a programmatic one. If you can't programm, then I suggest to learn it first and dont copy your code in here for someone to help debug your code. Sorry for being rude (and honest)

ADD REPLYlink written 5.3 years ago by my_name_is_simon10
5

while I don't disagree completely, I think in this case, it's a bioinformatics question and the OP has shown his code and what he has tried rather than just asking how to do it without any effort.

ADD REPLYlink written 5.3 years ago by brentp23k
3
gravatar for SES
5.3 years ago by
SES8.2k
Vancouver, BC
SES8.2k wrote:

First, there are some really good parts of your code.

I'm not going to write the code for you because you are close. The warning you are getting is because those array elements don't contain what you think they do, therefore you're trying to match an array element that was never initialized. Kudos for using use strict; and use warnings;, those will always tell you what is wrong.

Your approach seems fine, but try to think about what is more efficient, storing all the sequences and IDs vs just a list of IDs. One last thing, $_ is the default scalar, and you should not have to explicitly write that. The exception being with map/grep blocks. In my opinion, writing $_ is not expressive and it makes it hard to debug or even understand the code (it's also very ugly).


Here's the bad part:

 open SEQ, $ARGV[0];

Seeing this in code usually indicates someone either learned Perl 15 years ago or is just learning from some very dated materials. That line may be the single worst line of Perl you could write, partly because that typeglob has global scope. If you are not extremely careful, you will clobber your data very fast doing this. I encourage you pick up a modern Perl book and read about open (or on the web, or at the command line: perldoc perlfunc). You'll find the authors of the language routinely say they are trying to "rid the world" of this syntax (quoted from the latest camel book). Aside from possibly ruining your data, you are not checking if there is a file and if can be opened. This may be part of your uninitialized warnings.

I know people that do Perl training for a living and the only reason they would write this is to show you what kind or horrors might be encountered in the wild. In the Perl open tutorial this single argument version of open is demonstrated, followed by the statement, "Why is this here? Someone has to cater to the hysterical porpoises." Without being too technical, they are trying to say this is no longer acceptable, so don't do it. Remember this and it will save you loads of trouble. Don't feel bad though, we all wrote this kind of thing at one point.

ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by SES8.2k

"...$_ is the implicit filehandle..."

Did you mean $_ is the default scalar?

You've made good points here...

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by Kenosis1.2k

Yes, the topic or default scalar. I'll change the wording, my point was I avoid writing it if I can.

ADD REPLYlink written 5.3 years ago by SES8.2k
2
gravatar for Kenosis
5.3 years ago by
Kenosis1.2k
Kenosis1.2k wrote:

Perhaps the following will be helpful:

use strict;
use warnings;

my ( %list, %FHs, $id );

while (<>) {
    $list{$_} = "out_$." for split;
    last if eof;
}

local $/ = '>';
while (<>) {
    chomp;
    if ( ($id) = /(.+)/ and exists $list{$id} ) {
        open $FHs{ $list{$id} }, '>', $list{$id} or die $! unless defined $FHs{ $list{$id} };
        print { $FHs{ $list{$id} } } ">$_";
    }
}

Usage: perl extract_seq.pl listfile fastafile

The script first processes the list file, building a hash where the keys are the IDs and the associated values are the file names. Next, it sets Perl's record separator $/ to >, so the fasta file is read in chunks of fasta records. When it's processing the fasta file, it captures the record's ID and checks whether it exists as a key in %list. If there is such a key, it checks whether a file handle (held in %FHs) has been defined for it. If there's no file handle associated with the captured ID, then one's created. Finally, the fasta record is printed to the associated file.

ADD COMMENTlink modified 5.3 years ago • written 5.3 years ago by Kenosis1.2k
1
gravatar for brentp
5.3 years ago by
brentp23k
Salt Lake City, UT
brentp23k wrote:

You can accomplish what you're trying to do in perl. Usually, it's better not to read the sequences in to memory. Here's an example that reads the list file into memory and splits the output files according to the line the header appears on in the list file:

open(LIST, $ARGV[1]);

my $i = 0;
my %names;
my %FHS;
while(<LIST>){
    $i++;
    open($FHS{$i}, ">seqs_$i.fa");
    my @names = split(/\s+/, $_);
    $names{$i} = \@names;
}

open(SEQ, $ARGV[0]);
my $id;
while(<SEQ>){
    if (/^>(.+)/){
        $id = $1;
        next;
    } else {
        $seq = $_;
    }
    foreach my $k (keys %names){
        my $fh = $FHS{$k};
        if(grep { $_ eq $id } @{$names{$k}}){
            print $fh ">$id\n$seq";
        } 
    }
}
ADD COMMENTlink written 5.3 years ago by brentp23k
2
  • Always: use strict; use warnings;.
  • Use the three-argument form of open.
  • Your script doesn't handle multi-line sequences; the OP's does.
  • Your script creates n-1 empty files, where n is the number of blank lines in the list file.

How efficient is your algorithm? For each fasta sequence, you're effectively interating through all of the lines of the list file, examining all of the IDs in those lines.

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by Kenosis1.2k

cheers. that'll teach me to mind my manners when answering perl questions. :) though you'd have to have a pretty big list for the "algorithm" to matter.

ADD REPLYlink written 5.3 years ago by brentp23k

The major issue is that this code will fail silently, so you would never understand why it will not work. You don't have to worry about the first point with modern Perl (these pragmas are implicit with modern web frameworks, object systems, and by specifying a version >= 5.12 will import strictures and anything >= 5.014 (I think) will enable warnings), but often people are using decades old Perls without knowing it. These are therefore best practices. The issue with open() is major (I explain this in my answer), and the only reason it's allowed is because of backwards compatibility (people relying on decades old code).

ADD REPLYlink written 5.3 years ago by SES8.2k
1
gravatar for Pavel Senin
5.3 years ago by
Pavel Senin1.9k
Los Alamos, NM
Pavel Senin1.9k wrote:

try too change this line

my @a = (split /\t/, $_);

to this

my @a = ( split /\s+/, $_ );
ADD COMMENTlink written 5.3 years ago by Pavel Senin1.9k
1

Dear all, Thank you all for correcting and commenting on my script, whatever positively or negatively. I am a true perl beginner. I especially thank Kenosis, SES and brentp for writing new code for me. Cheers!

ADD REPLYlink written 5.3 years ago by biolab1.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 808 users visited in the last hour