Question: Parsing A Genbank Record With Bioperl
0
gravatar for robjohn7000
6.1 years ago by
robjohn700080
United Kingdom
robjohn700080 wrote:

Hi,

I would be happy if someone could help me get the following code to output the sequences (including the ids) of all records in the input file (http://biopython.org/DIST/docs/tutorial/examples/ls_orchid.gbk):

use Bio::SeqIO;
use Bio::Seq;
use Bio::DB::GenBank;

$seq_obj = Bio::SeqIO->new(-file => "ls_orchid.gbk" , '-format' => 'genbank');
$seqio_obj2 = Bio::SeqIO->new(-file => '>sequence.fasta', -format => 'genbank' );

while ( my $seq = $seq_obj->next_seq() ) {
print "Sequence ",$seq->id ($seq_obj)"\n";
#print $seq_obj->seq,"\n";
$seqio_obj2->write_seq($seq_obj);
}

I would like to see an output consisting of purely sequences (preferably with headers) as follows:

seq1:
ATTCGCTGCATGACAT..........
Seq2:
ACTGCGATGGATGGAT..

Thank you

ADD COMMENTlink modified 6.1 years ago by Kenosis1.2k • written 6.1 years ago by robjohn700080
2
gravatar for Kenosis
6.1 years ago by
Kenosis1.2k
Kenosis1.2k wrote:

You were very close:

use strict;
use warnings;
use Bio::SeqIO;

my $seq_obj = Bio::SeqIO->new( -file => "ls_orchid.gbk", '-format' => 'genbank' );

while ( my $seq = $seq_obj->next_seq() ) {
    print 'ID/Desc: ', $seq->id, ' ', $seq->desc, "\n";
    print 'Seq: ', $seq->seq, "\n";
}

Sample output using the data you've linked to:

ID/Desc: Z78532 C.californicum 5.8S rRNA gene and ITS1 and ITS2 DNA.
Seq: CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAACAGAATATATGATCGAGTGAATCTGGAGGACCTGTGGTAACTCAGCTCGTCGTGGCACTGCTTTTGTCGTGACCCTGCTTTGTTGTTGGGCCTCCTCAAGAGCTTTCATGGCAGGTTTGAACTTTAGTACGGTGCAGTTTGCGCCAAGTCATATAAAGCATCACTGATGAATGACATTATTGTCAGAAAAAATCAGAGGGGCAGTATGCTACTGAGCATGCCAGTGAATTTTTATGACTCTCGCAACGGATATCTTGGCTCTAACATCGATGAAGAACGCAGCTAAATGCGATAAGTGGTGTGAATTGCAGAATCCCGTGAACCATCGAGTCTTTGAACGCAAGTTGCGCTCGAGGCCATCAGGCTAAGGGCACGCCTGCCTGGGCGTCGTGTGTTGCGTCTCTCCTACCAATGCTTGCTTGGCATATCGCTAAGCTGGCATTATACGGATGTGAATGATTGGCCCCTTGTGCCTAGGTGCGGTGGGTCTAAGGATTGTTGCTTTGATGGGTAGGAATGTGGCACGAGGTGGAGAATGCTAACAGTCATAAGGCTGCTATTTGAATCCCCCATGTTGTTGTATTTTTTCGAACCTACACAAGAACCTAATTGAACCCCAATGGAGCTAAAATAACCATTGGGCAGTTGATTTCCATTCAGATGCGACCCCAGGTCAGGCGGGGCCACCCGCTGAGTTGAGGC

Hope this helps!

ADD COMMENTlink modified 6.1 years ago • written 6.1 years ago by Kenosis1.2k

That worked brilliantly! But I ran into a problem when I used the code to a multi-record file that I concatenated by myself. The output contained only the ID/Desc and Seq: lines. I'm not sure the best way to upload the file, but the error message pointed to every line with '\\' in the file (e.g., line 853 below): ID/Desc: NZ_ACVQ01000001 Campylobacter showae RM3277 contig00122, whole genome shotgun sequence. Use of uninitialized value in print at script.pl line 10, <gen0> line 853. Seq:

ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by robjohn700080

Am glad this worked for you. Am not sure I understand you correctly, but did you attempt to parse a file (that you created) with the above script and it threw an error?

ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by Kenosis1.2k

Yes, It threw an error with my new file and poited to the EOL (\\). I simply used cat function in linux to concatenate the genbanlk files. Thanks.

Here's the link to my file (org.rar): http://wikisend.com/download/642792/org.rar

ADD REPLYlink modified 6.1 years ago • written 6.1 years ago by robjohn700080

Hi Kenosis, this is the first script in perl that I have tried that is successful in extracting info from genbank files. I am having a hard time understanding the tutorials from bioperl seqio. Do you know how to extract the references (really I just need a title) in a multi-file genbank file? Thank you!

 

ADD REPLYlink written 4.4 years ago by kbrann30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1660 users visited in the last hour