I have downloaded complete mitochondrial genomes for one species from genbank and have written them to a file called myseqs.gbk (in genbank format - like a flat file). I now want to parse these genomes and extract only the sequence data relating to the cox3 gene, and save them as fasta files. In my first ever experience with biopython I tried to adapt the script kindly posted on this site last year (by microbeatic), but I am getting the error:
Traceback (most recent call last):
File "parser2.py", line 21, in <module>
print ">GeneId|%s|COX3|%s\n%s" % (gi,genome.description,genome.seq[start:end])
NameError: name 'start' is not defined
I am sure my mistakes are obvious to the regulars on this site and I would be very thankful for some advice. The script is below. I am guessing that I am not using the correct qualifiers and thereby filtering everything out.
Thank you very much.
from Bio import SeqIO, SeqFeature import sys gbank=SeqIO.parse('myseqs.gbk','genbank') for genome in gbank: for gene in genome.features: if(gene.type =="CDS"): if 'gene' in gene.qualifiers: if 'cox3' in gene.qualifiers['gene'].lower(): start=gene.location.start.position end=gene.location.end.position if 'db_xref' in gene.qualifiers: gi= gi=str(gene.qualifiers['db_xref']) gi=gi.split(":") gi=gi.split("'") print ">GeneId|%s|COX3|%s\n%s" % (gi,genome.description,genome.seq[start:end]) else: print ">GeneId|NoGenID|COX3|%s\n%s" % (genome.description,genome.seq[start:end])