Question: BioPerl retrieves joined sequence incorrectly
gravatar for fr
6.3 years ago by
fr150 wrote:


I am parsing a Genebank flat file from NCBI (the .gbff.gz found in here). To do this I'm using Bio::SeqIO and I am finding a problem in the very first CDS which looks like this

source          1..4480937
                     /organism="Alteromonas macleodii str. 'Deep ecotype'"
                     /mol_type="genomic DNA"
                     /strain="Deep ecotype"
gene            join(4480899..4480937,1..1626)
CDS             join(4480899..4480937,1..1626)
                     /inference="EXISTENCE: similar to AA
                     /note="Derived by automated computational analysis using
                     gene prediction method: Protein Homology."
                     /product="chromosomal replication initiator protein DnaA"


However, when I look at the nucleotide sequence that is outputed by Bio::SeqIO, it is outputting ~4.5 million nucleotides whereas if you look at the highlighted area it clearly shows that the coding sequence has ~1600ish nucleotides. What am I doing wrong? 

Below you can find a snippet of the code that I am using. The reason why I'm using

next unless (($feat->primary_tag eq "CDS") and $feat->has_tag("translation"));

is that I am only interested in getting protein sequences and their corresponding gene sequences.

Thank you for all help!


use strict;
use warnings;
use autodie;
use Data::Dump;
use File::Basename;
use Bio::SeqIO;
use Data::Dumper;

[$gbffpath and other non-relevant stuff hidden]

opendir(DIR, "$gbffpath");
my @files= grep {/\.gbff$/ } readdir(DIR);

my $k=1;
my ($i,$j);
foreach my $f (@files) {
    my $in = Bio::SeqIO -> new( -file => "<$gbffpath\\$f", -format => 'genbank');

    while (my $seq = $in->next_seq) {
        for my $feat ($seq -> get_SeqFeatures) {
            next unless (($feat->primary_tag eq "CDS") and $feat->has_tag("translation"));
            if ($feat->primary_tag eq "CDS") {#use to print DNA

              print $geneout "\>";
              # construct fasta header based on information present for each CDS:
              print $geneout $feat->get_tag_values('locus_tag'),"\|" if ( $feat->has_tag('locus_tag') );    
              print $geneout $feat->get_tag_values('db_xref') ,"\|" if ( $feat->has_tag('db_xref') );    
              print $geneout $feat->get_tag_values('protein_id'),"\|" if ( $feat->has_tag('protein_id') );    
              print $geneout $feat->get_tag_values('gene'),"\|" if ( $feat->has_tag('gene') ); 
              print $geneout $feat->get_tag_values('product'),"\|" if ( $feat->has_tag('product') );
              print $protout $feat->get_tag_values('note'), if ($feat->has_tag('note') );
              # print sequence:
              print $geneout "\n",$feat->seq->seq,"\n";

    if (($i ne $j) or ($i==0) or ($j==0)){
      print "\t\t$f\t$i CDS\t$j prots\t\n" ;
      print "\t\t\>\>\>$f\t$i CDS\t$j prots\t\n";
      print "Extracted $i\/$j CDS\/proteins from file $f ($k/".($#files+1).")\n";
cds gene bioperl genome perl • 2.5k views
ADD COMMENTlink modified 6.3 years ago by _r_am32k • written 6.3 years ago by fr150

It looks like the entire segment (1..4480937) is being printed. Hmmm, interesting!

ADD REPLYlink written 6.3 years ago by _r_am32k

this is not interesting! it should work as I thought! Why can't computers do what I think? Why dear lord whyy?
(on topic, any suggestions? I'm *very* new to using this kind of things in Perl :( )

ADD REPLYlink written 6.3 years ago by fr150

The part where you go $feat->seq->seq looks kinda fishy to me, given that the feature has no seq component in the file format.


ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by _r_am32k

Indeed! I confess that I am using an example that someone else gave me and I raised that question too. Nevertheless, I looked in the documentation of Bio::SeqIO in CPAN and they indicate that $feat->seq is the way to get a sequence. Do you have any suggestion?

ADD REPLYlink written 6.3 years ago by fr150

I think I can guess why. They might just be getting the MIN(ranges) and MAX(ranges) and going MIN -> MAX. In your case, unfortunately, that goes 1->END. Go for the spliced_seq() and join them.


ADD REPLYlink written 6.3 years ago by _r_am32k

Ah, nice it makes sense! I'll give it a try in a bit and will come back to let you know what I found! Thank you so much!

ADD REPLYlink written 6.3 years ago by fr150

Example: join(100..120, 130..140) might become 100..140 with their algo

join(END-10..END, 1..10) on the other hand, becomes 1..END

Again, this is just a guess, but try the spliced_seq(). It should definitely work.

ADD REPLYlink written 6.3 years ago by _r_am32k
gravatar for _r_am
6.3 years ago by
Baylor College of Medicine, Houston, TX
_r_am32k wrote:

Given how the seq() function is supposed to start from the beginning of the first range and go to the end of the last, you have an edge case here where your sequence fragments are at the ends. The method sees your 1..1626 and 4480899..4480937 and goes 1..4480937

I think you'd be better off using spliced_seq() and joining the result from that call.


ADD COMMENTlink written 6.3 years ago by _r_am32k

Using $feat->spliced_seq->seq instead of $feat->spliced_seq->seq solved the problem! Thanks!

ADD REPLYlink written 6.3 years ago by fr150

Great. Glad I could help. Istvan's solution is better, though. It comes from experience.

ADD REPLYlink written 6.3 years ago by _r_am32k
gravatar for Istvan Albert
6.3 years ago by
Istvan Albert ♦♦ 86k
University Park, USA
Istvan Albert ♦♦ 86k wrote:

Personally I suspect that the problem is that it lists the large values first (since it is a circular genome) instead of a natural, sorted order that most genomes would have.

You can use the readseq tool to perform the same task

readseq -format=fasta -feat=gene 

In general I would urge a lot of caution when whipping up small perl scripts to these tasks,  GeneBank files can be deceivingly complicated with the joins and complements.

ADD COMMENTlink modified 6.3 years ago • written 6.3 years ago by Istvan Albert ♦♦ 86k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1128 users visited in the last hour