Question

How To Scan Genbank Records And Extract Information From A File Using Perl

0

Entering edit mode

11.2 years ago

wendy • 0

Hi everyone,

I have a file containing many GenBank records. I want to scan each of the GenBank record, if the record contains keywords, I want to print the accession number. I have looked at an example and tried to modify it as follows:

#!/usr/bin/perl 
use strict;
use warnings;
use BeginPerlBioinfo;

my $annotation;
my %fields;
my @genbank =();
my $locus = '';
my $accession = '';
my $reference = '';
my @features = ();
@genbank = get_file_data ('all_2.txt');

for my $line (@genbank){
    if($line =~/^LOCUS/){
         $line =~ s/^LOCUS\s*//;
         $locus = $line ;
         print $locus ;
   }elsif ($line =~/^ACCESSION/){
         $line =~ s/^ACCESSION\s*//;
         $accession = $line ;
    }elsif ($line =~/^REFERENCE/){
         $line =~ s/^REFERENCE\s*//m;
         $reference = $line;
         print $reference;
   }elsif ($line =~/^FEATURES/){
         %fields = parse_annotation($annotation);
         @features = parse_features($fields {'FEATURES'});
  foreach my $feature (@features) {
         my ($featurename) = ($feature =~ /^{5}(\S+)/);
         print $feature;
  if ($locus=~ /keywords/i) || ($reference=~ /keywords/i) || ($feature=~ /keywords/i){
     print $accession;
      }
    } 
  }
}

sub parse_features {
my ($features) =@_;
my (@features) = ();
while ($features =~/^{5}\S.*\n(^{21}\S.*\n)*/gm){
  my $feature = $&;
push (@features, $feature);
  }
return @features;
}
exit;

I can print locus and accession number, for the reference and feature part I can only print the first line. At the same time, I got the errors:

*Quantifier unexpected on zero-length expression in regex; marked by <-- HERE in m/ ^{5}(\S+)/< --HERE

Quantifier unexpected on zero-length expression in regex; marked by <-- HERE in/^{5}\S.\n(^{21}\S.\n)* <-- HERE**

I know it is something related with the regular expression but I do not know how to solve it as I am just the beginner. May I know how should I solve this so that I can print the whole part of reference and features instead of just the first line of them? Thanks.

genbank perl • 6.1k views

ADD COMMENT • link updated 11.2 years ago by Malachi Griffith 19k • written 11.2 years ago by wendy • 0

1

Entering edit mode

Use Bioperl! There are already excellent libraries for parsing GenBank files that are already able to extract all the features and you don't have to worry about low-level text parsing (which is always a good thing to avoid): http://www.bioperl.org/wiki/HOWTO:SeqIO

ADD REPLY • link 11.2 years ago by Michael Schubert ★ 7.1k

0

Entering edit mode

The {n} notation says how many times to match the preceeding character. However, you have nothing ("zero-length") before those quantifiers, thus the errors of that type. If you have some data to share that you're trying to match, I'm sure someone here will assist. In the meantime, try regex101.com, as it's a good place to hone your regex skills.

BTW - Avoid using $& as it's costly; use ${^MATCH} instead.

ADD REPLY • link 11.2 years ago by Kenosis ★ 1.3k

score 1 · Answer 1 · 2013-02-23

1

Entering edit mode

11.2 years ago

Malachi Griffith 19k

As mentioned by Micheal you should consider using BioPerl.

Here is some example code: http://www.bioperl.org/wiki/HOWTO:Feature-Annotation

Here is a chapter providing an intro to BioPerl: Perl Programming for Bioinformatics - Chapter 9

And another book chapter in Beginning Perl for Bioinformatics that has some relevant example code in Chapter 10 (although not BioPerl in this case).