Question: How To Scan Genbank Records And Extract Information From A File Using Perl
0
gravatar for wendy
6.5 years ago by
wendy0
wendy0 wrote:

Hi everyone,

I have a file containing many GenBank records. I want to scan each of the GenBank record, if the record contains keywords, I want to print the accession number. I have looked at an example and tried to modify it as follows:

#!/usr/bin/perl 
use strict;
use warnings;
use BeginPerlBioinfo;

my $annotation;
my %fields;
my @genbank =();
my $locus = '';
my $accession = '';
my $reference = '';
my @features = ();
@genbank = get_file_data ('all_2.txt');

for my $line (@genbank){
    if($line =~/^LOCUS/){
         $line =~ s/^LOCUS\s*//;
         $locus = $line ;
         print $locus ;
   }elsif ($line =~/^ACCESSION/){
         $line =~ s/^ACCESSION\s*//;
         $accession = $line ;
    }elsif ($line =~/^REFERENCE/){
         $line =~ s/^REFERENCE\s*//m;
         $reference = $line;
         print $reference;
   }elsif ($line =~/^FEATURES/){
         %fields = parse_annotation($annotation);
         @features = parse_features($fields {'FEATURES'});
  foreach my $feature (@features) {
         my ($featurename) = ($feature =~ /^{5}(\S+)/);
         print $feature;
  if ($locus=~ /keywords/i) || ($reference=~ /keywords/i) || ($feature=~ /keywords/i){
     print $accession;
      }
    } 
  }
}

sub parse_features {
my ($features) =@_;
my (@features) = ();
while ($features =~/^{5}\S.*\n(^{21}\S.*\n)*/gm){
  my $feature = $&;
push (@features, $feature);
  }
return @features;
}
exit;

I can print locus and accession number, for the reference and feature part I can only print the first line. At the same time, I got the errors:


*Quantifier unexpected on zero-length expression in regex; marked by <-- HERE in m/ ^{5}(\S+)/< --HERE

Quantifier unexpected on zero-length expression in regex; marked by <-- HERE in/^{5}\S.\n(^{21}\S.\n)* <-- HERE**

I know it is something related with the regular expression but I do not know how to solve it as I am just the beginner. May I know how should I solve this so that I can print the whole part of reference and features instead of just the first line of them? Thanks.

perl genbank • 4.7k views
ADD COMMENTlink modified 6.5 years ago by Malachi Griffith17k • written 6.5 years ago by wendy0
1

Use Bioperl! There are already excellent libraries for parsing GenBank files that are already able to extract all the features and you don't have to worry about low-level text parsing (which is always a good thing to avoid): http://www.bioperl.org/wiki/HOWTO:SeqIO

ADD REPLYlink written 6.5 years ago by Michael Schubert6.9k

The {n} notation says how many times to match the preceeding character. However, you have nothing ("zero-length") before those quantifiers, thus the errors of that type. If you have some data to share that you're trying to match, I'm sure someone here will assist. In the meantime, try regex101.com, as it's a good place to hone your regex skills.

BTW - Avoid using $& as it's costly; use ${^MATCH} instead.

ADD REPLYlink modified 6.5 years ago • written 6.5 years ago by Kenosis1.2k
1
gravatar for Malachi Griffith
6.5 years ago by
Washington University School of Medicine, St. Louis, USA
Malachi Griffith17k wrote:

As mentioned by Micheal you should consider using BioPerl.

Here is some example code: http://www.bioperl.org/wiki/HOWTO:Feature-Annotation

Here is a chapter providing an intro to BioPerl: Perl Programming for Bioinformatics - Chapter 9

And another book chapter in Beginning Perl for Bioinformatics that has some relevant example code in Chapter 10 (although not BioPerl in this case).

ADD COMMENTlink modified 6.5 years ago • written 6.5 years ago by Malachi Griffith17k

+1 for the feature annotation link.

ADD REPLYlink modified 6.5 years ago • written 6.5 years ago by SES8.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 725 users visited in the last hour