Question: Creating Genbank files with Python using BioGenBankRecord
0
gravatar for toth.joe
7 months ago by
toth.joe20
toth.joe20 wrote:

I want to create my own Genbank files for import of sequence and annotations into a commercial DNA Software package.

I have a working version using Bio.SeqRecord but the SeqRecord format is limiting (I can't specify organism and some other attributes).

I am looking at Bio.GenBank.Record as a better implementation but I don't know enough python create a script using the source code Bio.GenBank.Record

Can someone point me in the right direction to get this test script to work?

#!/usr/bin/python

from Bio.GenBank.Record import Record
from Bio.Alphabet import IUPAC
from Bio.SeqFeature import SeqFeature, FeatureLocation
from Bio.SeqIO import InsdcIO

# Create a container with the data
container = Record(
    locus_line="SCU49845",
    size="5028 bp",
    residue_type="PROTEIN",
    data_file_division="PLN",
    accession="U49845",
    keywords="test",
    source="Saccharomyces cerevisiae (baker's yeast)",
    organism="Saccharomyces cerevisiae",
    features=['gene','687..3158','gene="AXL2"'],
    sequence="SSIYNGISTSGLDLNNGTIADMRQLGIVESYKLKRAVVSSASEAAEVLLRVDNIIRARPRTANRQHM"
   )

# Save as GenBank file
output_file = open('test.gb', 'w')
InsdcIO.write_record(container, output_file, 'genbank')
genbank biopython python • 621 views
ADD COMMENTlink modified 6 months ago • written 7 months ago by toth.joe20

You can specify the organism the SeqRecord for output in GenBank format via Bio.SeqIO.write(...) as the SOURCE and ORGANISM lines. You would need to populate seq_record.annotations['source'], and this is the approach I would recommend here. Relevant code: https://github.com/biopython/biopython/blob/biopython-173/Bio/SeqIO/InsdcIO.py#L920

ADD REPLYlink written 6 months ago by Peter5.8k
1
gravatar for toth.joe
6 months ago by
toth.joe20
toth.joe20 wrote:

Thanks for the information. I realized that I had a problem with the Genbank record format.

Here is the code that works:

from Bio.GenBank.Record import Record

 def main():  
/# open csv read entries row by row then insert in a dictionary  
    with open(args.sampleFile,'r') as f:  
        csvf=csv.DictReader(f)        
        for row in csvf:  
/# Create a container with the data using the dictionary  
            container = Record()  
            container.locus = row['Sample']  
            container.size = len(row['Seq'])  
            container.residue_type="PROTEIN"  
            container.data_file_division="PRI"  
            container.date = (datetime.date.today().strftime("%d-%b-%Y")) # today's date  
            container.definition = row['FullCloneName']  
            container.accession = [row['Vgene']]  
            container.comment = 'project xyz'  
            container.version = getpass.getuser()  
            container.keywords = [row['ProjectName']]  
/# Save as GenBank file  
            with open(row['ProjectName'] + '_' + row['FullCloneName'] + '.gb', 'w') as output_file:  
                output_file.write(str(container))
ADD COMMENTlink written 6 months ago by toth.joe20

You should be able to use Record(...) with arguments, but may of the terms in your first attempt are not available as arguments to the class initialiser:

https://github.com/biopython/biopython/blob/biopython-173/Bio/GenBank/Record.py#L164

ADD REPLYlink written 6 months ago by Peter5.8k
0
gravatar for Joe
6 months ago by
Joe14k
United Kingdom
Joe14k wrote:

I'm not super handy with the Genbank submodule as I've never tried to make one from scratch, but I'm fairly sure what you tried to do above doesn't work like you think it does.

A Genbank.Record describes the whole file (I think). What you're building there is the header of the genbank file, not a sequence record/feature. If you try to create that Record you would be thrown an error from Record.__init__() that there is no keyword argument for features. I'm pretty sure this is the case, because it wouldnt make much sense for every sequence feature in the genbank to also accept a data_file_division/locus_line etc.

If you read a genbank file in to BioPython via SeqIO and inspect its contents, you'll see all the sequence information is the .features attribute, and they are all objects of the form:

SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(29330), strand=1), type='source'),
SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(1065), strand=1), type='gene'), 
SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(1065), strand=1), type='CDS')

Therefore, I'm pretty sure you'll need to populate your Record() with SeqFeature() objects, which in turn contain FeatureLocation objects and associated metadata/methods.

In my toy example, the data structure looks something a little like this, within the genbank Record(), the attributes for actual entries are as follows (not an exhaustive example).

genbank.features[0] -> this is the source object, i.e. the full genbank sequence:

>>> repr(gbk.features[0])
"SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(29330), strand=1), type='source')"

Next up are the first 'real' features:

>>> repr(gbk.features[1])
"SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(1065), strand=1), type='gene')"
>>> gbk.features[1].qualifiers
"OrderedDict([('locus_tag', ['pPAU_00002'])])"

A 'gene' doesn't have much in the way of information (in my example at least), just a locus tag and location information. The cds, however, has much more interesting stuff:

>>> gbk.features[2]
SeqFeature(FeatureLocation(ExactPosition(0), ExactPosition(1065), strand=1), type='CDS')
>>> gbk.features[2].qualifiers
OrderedDict([('gene', ['repB']), ('locus_tag', ['pPAU_00001']), ('inference', ['ab initio prediction:Prodigal:2.60', 'similar to AA sequence:UniProtKB:Q57154']), ('codon_start', ['1']), ('transl_table', ['11']), ('product', ['RepFIB replication protein A']), ('translation', ['MIENFNENNDMSDMFWEVEKGTGEVINLVPNTSNTVQPVVLMRLGLFVPTLKSTKRGHQGEMSSMDATAELRQLAIVKTEGYENIHITGARLDMDNDFKTWVGIIHSFAKHKVIGDAVTLSFVDFIKLCGIPSSRSSKRLRERLGASLRRIATNTLSFSSQNKSYHTHLVQSAYYDMVKDTVTIQADPKIFELYQFDRKVLLQLRAINELGRKESAQALYTYIESLPPSPAPISLARLRARLNLRSRVTTQNAIVRKAMEQLKGIGYLDYTEIKRGSSVYFIVHARRPKLKALKSSKSSFKRKKETQEESILTELTREELELLEIIRAEKIIKVTRNHRRKKQTLLTFAEDESQ'])])

Long story short, you need to build your file up from progressively smaller objects in order for the final genbank record to have all the information it needs, but you're incorrectly conflating a GenBank.Record() and a sequence SeqRecord/SeqFeature. To my knowledge, this will be pretty tedious - I don't know of a particularly easy way of doing this.

ADD COMMENTlink written 6 months ago by Joe14k
1

No, the SeqFeature class goes with the SeqRecord class and these are intended to be file format neutral as used in Bio.SeqIO, but the design was heavily influenced by GenBank files. The GenBank Record class uses its own GenBank specific Feature objects etc all defined in https://github.com/biopython/biopython/blob/master/Bio/GenBank/Record.py

ADD REPLYlink written 6 months ago by Peter5.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 931 users visited in the last hour