Question: BioPerl retrieves joined sequence incorrectly
0
gravatar for rf
4.3 years ago by
rf100
rf100 wrote:

Hello!

I am parsing a Genebank flat file from NCBI (the .gbff.gz found in here). To do this I'm using Bio::SeqIO and I am finding a problem in the very first CDS which looks like this

source          1..4480937
                     /organism="Alteromonas macleodii str. 'Deep ecotype'"
                     /mol_type="genomic DNA"
                     /strain="Deep ecotype"
                     /db_xref="taxon:314275"
gene            join(4480899..4480937,1..1626)
                     /locus_tag="MADE_000001020675"
CDS             join(4480899..4480937,1..1626)
                     /locus_tag="MADE_000001020675"
                     /inference="EXISTENCE: similar to AA
                     sequence:RefSeq:YP_006797084.1"
                     /note="Derived by automated computational analysis using
                     gene prediction method: Protein Homology."
                     /codon_start=1
                     /transl_table=11
                     /product="chromosomal replication initiator protein DnaA"
                     /protein_id="AGV54147.1"
                     /db_xref="GI:543176245"
                     /translation="MSLWN...TLSS"

 

However, when I look at the nucleotide sequence that is outputed by Bio::SeqIO, it is outputting ~4.5 million nucleotides whereas if you look at the highlighted area it clearly shows that the coding sequence has ~1600ish nucleotides. What am I doing wrong? 

Below you can find a snippet of the code that I am using. The reason why I'm using

next unless (($feat->primary_tag eq "CDS") and $feat->has_tag("translation"));

is that I am only interested in getting protein sequences and their corresponding gene sequences.

Thank you for all help!

 

#!/usr/local/bin/perl
use strict;
use warnings;
use autodie;
use Data::Dump;
use File::Basename;
use Bio::SeqIO;
use Data::Dumper;

[$gbffpath and other non-relevant stuff hidden]

opendir(DIR, "$gbffpath");
my @files= grep {/\.gbff$/ } readdir(DIR);
closedir(DIR);

my $k=1;
my ($i,$j);
foreach my $f (@files) {
    $i=0;$j=0;
    my $in = Bio::SeqIO -> new( -file => "<$gbffpath\\$f", -format => 'genbank');

    while (my $seq = $in->next_seq) {
        for my $feat ($seq -> get_SeqFeatures) {
            next unless (($feat->primary_tag eq "CDS") and $feat->has_tag("translation"));
            if ($feat->primary_tag eq "CDS") {#use to print DNA

              print $geneout "\>";
              # construct fasta header based on information present for each CDS:
              print $geneout $feat->get_tag_values('locus_tag'),"\|" if ( $feat->has_tag('locus_tag') );    
              print $geneout $feat->get_tag_values('db_xref') ,"\|" if ( $feat->has_tag('db_xref') );    
              print $geneout $feat->get_tag_values('protein_id'),"\|" if ( $feat->has_tag('protein_id') );    
              print $geneout $feat->get_tag_values('gene'),"\|" if ( $feat->has_tag('gene') ); 
              print $geneout $feat->get_tag_values('product'),"\|" if ( $feat->has_tag('product') );
              print $protout $feat->get_tag_values('note'), if ($feat->has_tag('note') );
               
              # print sequence:
              print $geneout "\n",$feat->seq->seq,"\n";
              $i++;
              }
            }
        }
    }

    if (($i ne $j) or ($i==0) or ($j==0)){
      print "\t\t$f\t$i CDS\t$j prots\t\n" ;
      print "\t\t\>\>\>$f\t$i CDS\t$j prots\t\n";
    }
    else{
      print "Extracted $i\/$j CDS\/proteins from file $f ($k/".($#files+1).")\n";
    }
    $k++;
}
cds gene bioperl genome perl • 1.9k views
ADD COMMENTlink modified 4.3 years ago by RamRS20k • written 4.3 years ago by rf100

It looks like the entire segment (1..4480937) is being printed. Hmmm, interesting!

ADD REPLYlink written 4.3 years ago by RamRS20k

this is not interesting! it should work as I thought! Why can't computers do what I think? Why dear lord whyy?
(on topic, any suggestions? I'm *very* new to using this kind of things in Perl :( )

ADD REPLYlink written 4.3 years ago by rf100

The part where you go $feat->seq->seq looks kinda fishy to me, given that the feature has no seq component in the file format.

 

ADD REPLYlink modified 4.3 years ago • written 4.3 years ago by RamRS20k

Indeed! I confess that I am using an example that someone else gave me and I raised that question too. Nevertheless, I looked in the documentation of Bio::SeqIO in CPAN and they indicate that $feat->seq is the way to get a sequence. Do you have any suggestion?

ADD REPLYlink written 4.3 years ago by rf100
1

I think I can guess why. They might just be getting the MIN(ranges) and MAX(ranges) and going MIN -> MAX. In your case, unfortunately, that goes 1->END. Go for the spliced_seq() and join them.

Source: http://www.bioperl.org/wiki/HOWTO:Feature-Annotation

ADD REPLYlink written 4.3 years ago by RamRS20k

Ah, nice it makes sense! I'll give it a try in a bit and will come back to let you know what I found! Thank you so much!

ADD REPLYlink written 4.3 years ago by rf100
1

Example: join(100..120, 130..140) might become 100..140 with their algo

join(END-10..END, 1..10) on the other hand, becomes 1..END

Again, this is just a guess, but try the spliced_seq(). It should definitely work.

ADD REPLYlink written 4.3 years ago by RamRS20k
3
gravatar for RamRS
4.3 years ago by
RamRS20k
Houston, TX
RamRS20k wrote:

Given how the seq() function is supposed to start from the beginning of the first range and go to the end of the last, you have an edge case here where your sequence fragments are at the ends. The method sees your 1..1626 and 4480899..4480937 and goes 1..4480937

I think you'd be better off using spliced_seq() and joining the result from that call.

Source: http://www.bioperl.org/wiki/HOWTO:Feature-Annotation

ADD COMMENTlink written 4.3 years ago by RamRS20k

Using $feat->spliced_seq->seq instead of $feat->spliced_seq->seq solved the problem! Thanks!

ADD REPLYlink written 4.3 years ago by rf100

Great. Glad I could help. Istvan's solution is better, though. It comes from experience.

ADD REPLYlink written 4.3 years ago by RamRS20k
1
gravatar for Istvan Albert
4.3 years ago by
Istvan Albert ♦♦ 79k
University Park, USA
Istvan Albert ♦♦ 79k wrote:

Personally I suspect that the problem is that it lists the large values first (since it is a circular genome) instead of a natural, sorted order that most genomes would have.

You can use the readseq tool to perform the same task

readseq -format=fasta -feat=gene G.gb 

In general I would urge a lot of caution when whipping up small perl scripts to these tasks,  GeneBank files can be deceivingly complicated with the joins and complements.

ADD COMMENTlink modified 4.3 years ago • written 4.3 years ago by Istvan Albert ♦♦ 79k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 613 users visited in the last hour