How To Extract Just 'Cds' From Genbank File Into Another Genbank File?
1
0
Entering edit mode
9.2 years ago
microbeatic ▴ 80

I have a genbank file with multiple genomes (concatenated using cat). I want to produce a subset of this file with only 'CDS' sequences. It seems very trivial, but i am having trouble with this. Here is the code which succesfully prints each CDS record, but fails to produce a genbank file with those CDS.

from Bio import SeqIO,SeqFeature
import sys

gbank=SeqIO.parse(open(sys.argv[1],"rU"),"genbank"

for genome in gbank:
print "looking in %s" %genome.id)
for gene in genome.features:
if gene.type == 'CDS':
CDS=gene
print CDS

output_handle=open("all_CDS.gbk","w")
SeqIO.write(CDS,output_handle,"genbank")
output_handle.close()


The code prints CDS, but it produces following error at the end.

AttributeError: 'int' object has no attribute 'name'

genbank biopython • 11k views
3
Entering edit mode
9.2 years ago
Peter 6.0k

The error you are getting is because you are not giving SeqIO.write a SeqRecord object, but a single SeqFeature object (the final CDS looked at in the loop).

You need to edit the SeqRecord object to remove the unwanted SeqFeature objects, and then save that, e.g.

from Bio import SeqIO
record = SeqIO.parse("original.gbk","genbank")
record.features = [f for f in record.features if f.type == "CDS"]
SeqIO.write(record, "only_cds.gbk", "genbank")


That should give you a simpler GenBank file where the feature table only contains the CDS features. I'm not sure why you'd do that, but that is what I think you are asking for.

0
Entering edit mode

Thanks @Peter. It worked. I needed to "slimdown" the genbank file which i was using to extract location information and add to a sequence header. I am pretty novice to biopython and python (a month), so i realize its highly inefficient.