Question: How Can I Save Bioperl Sequence Nested Features In Genbank Or Embl Format?
4
gravatar for Ryan Thompson
4.0 years ago by
Ryan Thompson2.4k
TSRI, La Jolla, CA
Ryan Thompson2.4k wrote:

In BioPerl, a sequence object can have any number of features, and each of these can have subfeatures nested within them. For example, a feature may be a complete coding sequence of a gene, and its subfeatures might be individual exons that are concatenated to form the full coding sequence.

However, when I use BioPerl to write a sequence object to a file in genbank or embl format, only the top-level features are written to the file, not the sub-features nested within the top-level features. How can I store my subfeatures in sequence files? Should I just convert all my subfeatures into top-level features, and then reconstruct the tree structure next time I read in the sequence?

Example code:

#!/usr/bin/perl

use Bio::SeqIO;
use Bio::SeqFeature::Generic;

# Read in a sequence
my $sequence = Bio::SeqIO->newFh(
    -file => $ARGV[0],
)->getline();

# Add the main feature
my $main_feature = Bio::SeqFeature::Generic->new(
    -start => 1,
    -end => 100,
    -primary_tag => 'main_feature',
);
$sequence->add_SeqFeature($main_feature);

# Add some subfeatures
for my $i (1..10) {
    my $subfeature = Bio::SeqFeature::Generic->new(
        -start => ($i * 10 - 9),
        -end => ($i * 10),
        -primary_tag => "Sub-feature $i",
    );
    $main_feature->add_SeqFeature($subfeature);
}

# Write the sequence to the output file. Subfeatures are not saved.
Bio::SeqIO->new(
    -fh => *STDOUT{IO},
    -format => "genbank",
)->write_seq($sequence) or die "FAIL";

Run the example code on any sequence file. It will print a genbank sequence record to STDOUT. This record will contain "main_feature" but not any of the ten "Sub-feature $i". If you use something like Data::Dump to dump the perl structure corresponding to $sequence, you will see that it does contain the sub features. It simply does not output them.

Edit: I just realised that there's another aspect to my question. I'm trying to write two scripts: one to analyze the sequences and add the subfeatures, and a second script to read the sequences and report on the subfeatures. So an additional requirement is that I can easily read the resulting sequences back into BioPerl including the subfeatures.

ADD COMMENTlink modified 4.0 years ago by Chris Fields490 • written 4.0 years ago by Ryan Thompson2.4k

Creating a RichSeq object from scratch can be a little tricky. You're probably writing you subfeatures in a limited scope. But, without a piece of code (or the modules you're using) a correct answer will be difficult.

ADD REPLYlink written 4.0 years ago by Jarretinha2.9k

Ok, I'll post some example code.

ADD REPLYlink written 4.0 years ago by Ryan Thompson2.4k

What are all the features you are trying to print ? Can you please paste the full code to pastebin and provide a link ? I wont be able to test the code in its current format.

ADD REPLYlink written 4.0 years ago by Khader Shameer14k

I posted a full snippet of example code. You just need an arbitrary sequence file to run it on.

ADD REPLYlink written 4.0 years ago by Ryan Thompson2.4k

Another question: are you trying to convert between file formats or do you really need to annotate raw sequences. For the first option, SeqIO do the whole job (subfeatures included). For the second case, you will need my complete non-trivial answer.

ADD REPLYlink written 4.0 years ago by Jarretinha2.9k

I'm trying to find a sequence format to use as intermediate data storage between an annotator script and a reporter script.

ADD REPLYlink written 4.0 years ago by Ryan Thompson2.4k
4
gravatar for Neilfws
4.0 years ago by
Neilfws41k
Sydney, Australia
Neilfws41k wrote:

I think the problem may be that whilst Bioperl has a concept of features + subfeatures, a GenBank file does not: anything with a location is simply a feature, regardless of whether it lies within another feature. Sub-features are created by Bioperl when "rich" sequences are read and only for specific tags (such as gene/CDS/exon).

In your code, if you change the loop which generates the "subfeatures" so as the last line is:

$sequence->add_SeqFeature($subfeature);

i.e. subfeatures are added to the sequence object, not the main feature object, then the subfeatures will appear in the output (as features).

Where a feature in a GenBank file contains several lines - those are annotations, not features.

ADD COMMENTlink written 4.0 years ago by Neilfws41k

Ok, so is there a sequence format the does support subfeatures? Should I forget SeqIO and just use Data::Dump?

ADD REPLYlink written 4.0 years ago by Ryan Thompson2.4k

Also, this is the workaround I'm using right now.

ADD REPLYlink written 4.0 years ago by Ryan Thompson2.4k

I guess GFF is the closest to a representation of sub-features, with its "parent" attribute. There's a gff_string method in Bio::SeqFeature, so you'd just add "parent" + the ID of the main feature as a tag of the sub-feature.

ADD REPLYlink written 4.0 years ago by Neilfws41k

asciitree, chadoxml e gff do support nested references/features. There is no parser for asciitree, chadoxml is well supported in bioperl. Biopython can read/write gff (up to gff3 I guess) and bioperl can write gff. Chadoxml seems to have a bug in somes *nix distributions. Tried asciitree and worked perfectly with your code.

ADD REPLYlink written 4.0 years ago by Jarretinha2.9k

I'll return to the flatten/unflatten problem. I've generated fake genbank records some time ago from simulated sequences.

ADD REPLYlink written 4.0 years ago by Jarretinha2.9k

feel free to accept this as best answer then :-)

ADD REPLYlink written 4.0 years ago by Neilfws41k

Yeah. I did. I ended up implementing a custom "unflatten" subroutine in the second script.

ADD REPLYlink written 4.0 years ago by Ryan Thompson2.4k
2
gravatar for Chris Fields
4.0 years ago by
Chris Fields490
University of Illinois Urbana-Champaign
Chris Fields490 wrote:

Neil is correct, GenBank/EMBL do not layer features (e.g. everything is top-level), and BioPerl makes no assumptions based on that: WYSIWYG. There are a few scripts bp_genbank2gff3.pl) that help along these lines, in that it helps convert to a format that allows hierarchal features, gene > mRNA > CDS/exon/UTR and so on. The caveat is there is enough variability between records that finding a consistent way to do this is still difficult.

ADD COMMENTlink written 4.0 years ago by Chris Fields490
1
gravatar for Khader Shameer
4.0 years ago by
Rochester, MN
Khader Shameer14k wrote:

AFAIK you have to define the subfeatures explicitly in the script to generate the genbank or embl format. Have you referred to the Feature-Annotation HOWTO. If you can show the code, it will be better.

ADD COMMENTlink written 4.0 years ago by Khader Shameer14k

What I mean is, I have a script that reads in some sequences, annotates them with features and subfeatures, and then writes the sequences to an output file. But when I do that, the subfeatures are lost.

ADD REPLYlink written 4.0 years ago by Ryan Thompson2.4k
Please log in to add an answer.

Help
Access
  • RSS
  • Stats
  • API

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.0.0
Traffic: 623 users visited in the last hour