Print gene locations from a .gbk file
2
1
Entering edit mode
8.9 years ago
kxd419 ▴ 10

Hello,

I am trying extract certain information from a gbk file I can extract the locus tag and the amino acid sequence however I am struggling to extract the gene location as it not in the same format in the file e.g.: /locus_tag="NCTC86_00002"

This is my script so far:

from Bio import GenBank
from Bio import SeqIO

gbk_filename = "HS.gb"
faa_filename = "HS_converted.faa"

input_handle = open(gbk_filename, "r")
output_handle = open(faa_filename, "w")

for seq_record in SeqIO.parse(input_handle, "genbank"):
    print "Dealing with GenBank record %s" % seq_record.id
    for seq_feature in seq_record.features :
        if seq_feature.type=="CDS" :
            assert len(seq_feature.qualifiers['translation'])==1
            output_handle.write(">%s from %s\n%s\n" % (
                   seq_feature.qualifiers['locus_tag'][0],
                   seq_record.name,
                   seq_feature.qualifiers['translation'][0]))

output_handle.close()
input_handle.close()
print "Done"
gbk python gene • 5.1k views
ADD COMMENT
0
Entering edit mode

So also if I wanted to print the gene annotation, not every CDS entry contains a /gene=""

Do I need to put in a if there is no /gene="" clause?

ADD REPLY
3
Entering edit mode
8.9 years ago
mgalactus ▴ 770

Hi,

the SeqFeature objects have a "location" attribute that contains the start/stop position of the feature.

from Bio import SeqIO

gbk_filename = "HS.gb"
faa_filename = "HS_converted.faa"

output_handle = open(faa_filename, "w")

for seq_record in SeqIO.parse(gbk_filename, "genbank") :
    print "Dealing with GenBank record %s" % seq_record.id
    for seq_feature in seq_record.features :
        if seq_feature.type=="CDS":
            assert len(seq_feature.qualifiers['translation'])==1
            output_handle.write(">%s from %s\n%s\n" % (
                   seq_feature.qualifiers['locus_tag'][0],
                   seq_record.name,
                   seq_feature.qualifiers['translation'][0]))
            print('Start: %d, Stop: %d, Strand: %d'%(int(seq_feature.location.start),
                                                     int(seq_feature.location.end),
                                                     seq_feature.strand))

output_handle.close()
print "Done"

Hope this helps

ADD COMMENT
0
Entering edit mode

Hi mgalactus

Thank you so much for your help. Im brand new to python and am trying to learn my best.

When I run that script I get this error:

Traceback (most recent call last):

  File "gbkaafasta.py", line 18, in <module>
    print('Start: %d, Stop: %d, Strand: %d'%(int(seq_feature.locations.start),
AttributeError: 'SeqFeature' object has no attribute 'locations'

Thank you so much for your help

ADD REPLY
0
Entering edit mode

My fault, it should have been "location" and not "locations" (I've updated the reply)

ADD REPLY
0
Entering edit mode

You are fantastic! It works perfectly. Thank you so much

ADD REPLY
1
Entering edit mode
6.5 years ago
nhaituan ▴ 10

Thank you very much, here my modified codes:

from Bio import SeqIO
gbk_filename = "N16961_Ch1.gbk"
faa_filename = "N16961_Ch1.faa"
output_handle = open(faa_filename, "w")
for seq_record in SeqIO.parse(gbk_filename, "genbank") :
    print "Dealing with GenBank record %s" % seq_record.id
    for seq_feature in seq_record.features :
        if seq_feature.type=="CDS":
            assert len(seq_feature.qualifiers['translation'])==1
            output_handle.write(">%s Start: %s Stop: %s Strand: %s From %s\n" % (
                   seq_feature.qualifiers['locus_tag'][0],
                   seq_feature.location.start,
                   seq_feature.location.end,
                   seq_feature.strand,
                   seq_record.name))

output_handle.close()
print "Done"
ADD COMMENT

Login before adding your answer.

Traffic: 1805 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6