get protein sequences from genbank
2
1
Entering edit mode
5.7 years ago
aproposdiver ▴ 10

I have this script below and it seems to be hard coded for files in the working directory. I basically know nothing about python but I want to make this script so it can read in a many genbank files instead of just be coded for one file at a time.

from Bio import GenBank
 
gbk_filename = "NC_005213.gbk"
faa_filename = "NC_005213_converted.faa"
 
input_handle  = open(gbk_filename, "r")
output_handle = open(faa_filename, "w")
 
for seq_record in SeqIO.parse(input_handle, "genbank") :
    print "Dealing with GenBank record %s" % seq_record.id
    for seq_feature in seq_record.features :
        if seq_feature.type=="CDS" :
            assert len(seq_feature.qualifiers['translation'])==1
            output_handle.write(">%s from %s\n%s\n" % (
                   seq_feature.qualifiers['locus_tag'][0],
                   seq_record.name,
                   seq_feature.qualifiers['translation'][0]))
 
output_handle.close()
input_handle.close()
print "Done"

 

 

 

biopython genbank • 6.4k views
ADD COMMENT
0
Entering edit mode

i am not familiar with scripts but instead i think you can retrieve Genbank Protein GI (from http://biodbnet.abcc.ncifcrf.gov/db/db2db.php) then in Batch Entrez you can retrieve the bulk protein sequence

ADD REPLY
0
Entering edit mode

Yes but the main function I am looking for is to get the sequences parsed into separate files based on the different gb accession for the overall contig.
 

ADD REPLY
0
Entering edit mode

don't worry friend, your problem will be solved soon...

ADD REPLY
3
Entering edit mode
5.7 years ago

This will convert any file specified at the command line.

​"""
Usage: converter.py foo.gbk

Produces a converted file: foo_converted.faa
"""
import sys, os
from Bio import GenBank, SeqIO

gbk_filename = sys.argv[1]
root_name = os.path.splitext(gbk_filename)[0]
faa_filename = root_name + "_converted.faa"

input_handle  = open(gbk_filename, "r")
output_handle = open(faa_filename, "w")

for seq_record in SeqIO.parse(input_handle, "genbank") :
    print "Dealing with GenBank record %s" % seq_record.id
    for seq_feature in seq_record.features :
        if seq_feature.type=="CDS" :
            assert len(seq_feature.qualifiers['translation'])==1
            output_handle.write(">%s from %s\n%s\n" % (
                   seq_feature.qualifiers['locus_tag'][0],
                   seq_record.name,
                   seq_feature.qualifiers['translation'][0]))

output_handle.close()
input_handle.close()
print "Done"
ADD COMMENT
0
Entering edit mode

Thanks for this. I tried it but got the error:

line 3
SyntaxError: Non-ASCII character '\xe2' in file /home/kpenn/Python.scripts.dir/Split.Genbank.to.faa.std.input.py on line 3, but no encoding declared; see http://www.python.org/peps/pep-0263.html for details

 

 

This is the script:

#!/usr/bin/python                                                                                                                                                                                                                                                       

# """                                                                                                                                                                                                                                                                   
Usage: converter.py foo.gbk

Produces a converted file: foo_converted.faa
#"""                                                                                                                                                                                                                                                                    
import sys, os
from Bio import GenBank, SeqIO

gbk_filename = sys.argv[1]
root_name = os.path.splitext(gbk_filename)[0]
faa_filename = root_name + "_converted.faa"

input_handle  = open(gbk_filename, "r")
output_handle = open(faa_filename, "w")

for seq_record in SeqIO.parse(input_handle, "genbank") :
    print "Dealing with GenBank record %s" % seq_record.id
    for seq_feature in seq_record.features :
        if seq_feature.type=="CDS" :
            assert len(seq_feature.qualifiers['translation'])==1
            output_handle.write(">%s from %s\n%s\n" % (
                   seq_feature.qualifiers['locus_tag'][0],
                   seq_record.name,
                   seq_feature.qualifiers['translation'][0]))

output_handle.close()
input_handle.close()
print "Done"

 

 

 

 

 

ADD REPLY
0
Entering edit mode

you must have added invalid characters to the script when you've edited it - remove those

ADD REPLY
0
Entering edit mode

By the way: Be aware of pseudogenes, those (most probably?) won't have a translation. And this would trigger the assert. But you can avoid them by: if 'pseudo' not in seq_feature.qualifiers:

ADD REPLY
0
Entering edit mode

Do you have the corrected script @mschmid?

ADD REPLY
0
Entering edit mode
5.5 years ago
aproposdiver ▴ 10

I cant seem to figure out how to modify this script to extract just the coding sequence in DNA format.  I am pretty sure I need to use extract but have not been able to modify this properly to get that.

 


 

ADD COMMENT
0
Entering edit mode

If you are willing to consider non-Python options, you can use e-utilities. I have answered a similar question to get amino acid sequences in FASTA format, given a chromosome accession.

To get protein coding, nucleotide sequences, you need to set the 'rettype' option to 'fasta_cds_na'

For chromosome accession NC_005213:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=NC_005213&rettype=fasta_cds_na&retmode=text

ADD REPLY

Login before adding your answer.

Traffic: 2388 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6