Question: How to write FULL genbank files with Biopython SeqIO module?
0
gravatar for russianconcussion
3.4 years ago by
United States
russianconcussion20 wrote:

Hi,

I am trying to split up the Synechococcus genbank files from NCBI Genbank into separate genbank files for each genome.  Some of the genomes have several genbank files because they are draft assemblies.  So, I import the SeqIO library from Bio, parse the conglomerated genbank files, put them into a dictionary of lists with their gb.name as the key, then iterate through the dictionary with SeqIO.write to write the genbank files.  However, the output does not include the fasta nucleotide information that is usually in a FULL genbank file, just the features.  I've looked through the BioPython Tutorial and previous questions on BioStars, but I am still wondering what am I doing wrong.  Any help?

Original data: https://www.dropbox.com/s/kfwothubsrh5vtc/synechococcus.gb?dl=0

from collections import defaultdict

from Bio import SeqIO

gbDict = defaultdict(list)

gbs = SeqIO.parse("synechococcus.gb", "genbank")

for gb in gbs:                                
    gbDict[gb.name].append(gb)

for gb in gbDict:                             
    SeqIO.write(gbDict[gb], ".".join([gb,"gb"]), "genbank")
genbank biopython • 2.1k views
ADD COMMENTlink modified 3.3 years ago by Peter5.8k • written 3.4 years ago by russianconcussion20
1
gravatar for Peter
3.3 years ago by
Peter5.8k
Scotland, UK
Peter5.8k wrote:

The problem is your input data - some of the records simply don't have the sequence you want, just the features. Looking at your GenBank file, some of your records do have sequences in it - but there are others which do not, e.g. ``NZ_LFEK01000000`` only has a ``WGS`` line.

Depending on how you get the data from the NCBI, they will either give you full GenBank files (with sequence) or summary records without the sequence, but instead a ``CONTIG`` or ``WGS`` line saying how the sequence is constructed from other records.

Also, I would rewrite your code to avoid loading everything into RAM in a dictionary, and also to avoid the slight data loss from from reading and writing the records. Instead, use the Biopython indexing to pull out the raw records:

from Bio import SeqIO
d = SeqIO.index("synechococcus.gb", "genbank")
print("You have %i sequences" % len(d))
for key in d:
    with open(key + ".gb", "wb") as handle:
        handle.write(d.get_raw(key))
print("Done")

This will name the files using record.id rather than record.name which you may or may not prefer.

ADD COMMENTlink written 3.3 years ago by Peter5.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1060 users visited in the last hour