Hello!
I am parsing a Genebank flat file from NCBI (the .gbff.gz found in here). To do this I'm using Bio::SeqIO and I am finding a problem in the very first CDS which looks like this
source 1..4480937
/organism="Alteromonas macleodii str. 'Deep ecotype'"
/mol_type="genomic DNA"
/strain="Deep ecotype"
/db_xref="taxon:314275"
gene join(4480899..4480937,1..1626)
/locus_tag="MADE_000001020675"
CDS join(4480899..4480937,1..1626)
/locus_tag="MADE_000001020675"
/inference="EXISTENCE: similar to AA
sequence:RefSeq:YP_006797084.1"
/note="Derived by automated computational analysis using
gene prediction method: Protein Homology."
/codon_start=1
/transl_table=11
/product="chromosomal replication initiator protein DnaA"
/protein_id="AGV54147.1"
/db_xref="GI:543176245"
/translation="MSLWN...TLSS"
However, when I look at the nucleotide sequence that is outputed by Bio::SeqIO, it is outputting ~4.5 million nucleotides whereas if you look at the highlighted area it clearly shows that the coding sequence has ~1600ish nucleotides. What am I doing wrong?
Below you can find a snippet of the code that I am using. The reason why I'm using
next unless (($feat->primary_tag eq "CDS") and $feat->has_tag("translation"));
is that I am only interested in getting protein sequences and their corresponding gene sequences.
Thank you for all help!
#!/usr/local/bin/perl use strict; use warnings; use autodie; use Data::Dump; use File::Basename; use Bio::SeqIO; use Data::Dumper; [$gbffpath and other non-relevant stuff hidden] opendir(DIR, "$gbffpath"); my @files= grep {/\.gbff$/ } readdir(DIR); closedir(DIR); my $k=1; my ($i,$j); foreach my $f (@files) { $i=0;$j=0; my $in = Bio::SeqIO -> new( -file => "<$gbffpath\\$f", -format => 'genbank'); while (my $seq = $in->next_seq) { for my $feat ($seq -> get_SeqFeatures) { next unless (($feat->primary_tag eq "CDS") and $feat->has_tag("translation")); if ($feat->primary_tag eq "CDS") {#use to print DNA print $geneout "\>"; # construct fasta header based on information present for each CDS: print $geneout $feat->get_tag_values('locus_tag'),"\|" if ( $feat->has_tag('locus_tag') ); print $geneout $feat->get_tag_values('db_xref') ,"\|" if ( $feat->has_tag('db_xref') ); print $geneout $feat->get_tag_values('protein_id'),"\|" if ( $feat->has_tag('protein_id') ); print $geneout $feat->get_tag_values('gene'),"\|" if ( $feat->has_tag('gene') ); print $geneout $feat->get_tag_values('product'),"\|" if ( $feat->has_tag('product') ); print $protout $feat->get_tag_values('note'), if ($feat->has_tag('note') ); # print sequence: print $geneout "\n",$feat->seq->seq,"\n"; $i++; } } } } if (($i ne $j) or ($i==0) or ($j==0)){ print "\t\t$f\t$i CDS\t$j prots\t\n" ; print "\t\t\>\>\>$f\t$i CDS\t$j prots\t\n"; } else{ print "Extracted $i\/$j CDS\/proteins from file $f ($k/".($#files+1).")\n"; } $k++; }
It looks like the entire segment (1..4480937) is being printed. Hmmm, interesting!
this is not interesting! it should work as I thought! Why can't computers do what I think? Why dear lord whyy?
(on topic, any suggestions? I'm *very* new to using this kind of things in Perl :( )
The part where you go
$feat->seq->seq
looks kinda fishy to me, given that the feature has noseq
component in the file format.Indeed! I confess that I am using an example that someone else gave me and I raised that question too. Nevertheless, I looked in the documentation of Bio::SeqIO in CPAN and they indicate that
$feat->seq
is the way to get a sequence. Do you have any suggestion?I think I can guess why. They might just be getting the MIN(ranges) and MAX(ranges) and going MIN -> MAX. In your case, unfortunately, that goes 1->END. Go for the spliced_seq() and join them.
Source: http://www.bioperl.org/wiki/HOWTO:Feature-Annotation
Ah, nice it makes sense! I'll give it a try in a bit and will come back to let you know what I found! Thank you so much!
Example:
join(100..120, 130..140)
might become100..140
with their algojoin(END-10..END, 1..10)
on the other hand, becomes1..END
Again, this is just a guess, but try the
spliced_seq()
. It should definitely work.