Parsing Genbank To Fasta Format In Biopython For Metagenomic Classification
1
0
Entering edit mode
12.1 years ago
Josh Herr 5.8k

Please forgive the newbie question, but I am indeed new to BioPython. I'm just simply trying to parse a large file in Genbank format to FASTA format and am using Bio.SeqIO in BioPython.

I'm looking to parse an output file with the Accession number and Taxon in the FASTA > header and then the Genbank Taxonomy instead of the nucleotide sequence. I am comfortable with parsing just the fasta title and sequence. What I am doing is constructing a file to train RDP classifier for a Eukaryote marker gene (one does not already exist for my marker).

The output I am looking for is:

X62988Emericellanidulans
Eukaryota; Fungi; Dikarya; Ascomycota; Pezizomycotina; Eurotiomycetes; Eurotiomycetidae; Eurotiales; Trichocomaceae; Emericella; Emericella nidulans.

or:

573145
Bacteria; Proteobacteria; Gammaproteobacteria; Enterobacteriales; Enterobacteriaceae; Escherichia; Escherichia sp.

This is what I have used for simple nucleotide parsing.

from Bio import SeqIO

gbk_filename = "genbank.gbk"
faa_filename = "fasta.fna"

input_handle  = open(gbk_input, "r")
output_handle = open(faa_output, "w")

for seq_record in SeqIO.parse(input_handle, "genbank") :
    print "Parsing GenBank record %s" % seq_record.id
    output_handle.write(">%s %s\n%s\n" % (
           seq_record.id,
           seq_record.description,
           seq_record.seq.tostring()))

output_handle.close()
input_handle.close()
print "Completed"

I know this is probably a simple fix, but I've searched for a long while and can't find an output in SeqIO for the taxonomy string, does anyone have any recommendations for modifying the above script?

Thanks so much for helping me out... I'm pretty new to the BioPython parsing here. My best to you all.

biopython parsing metagenomics fasta • 5.7k views
ADD COMMENT
4
Entering edit mode
12.1 years ago

The annotations method returns a dictionary that holds the various attributes in the genbank file. This:

seq_record.annotations['taxonomy']

will return an array of the taxonomy line. To get it in the format you want, you can

'; '.join(seq_record.annotations['taxonomy'])
ADD COMMENT

Login before adding your answer.

Traffic: 2630 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6