Question

How To Extract Just 'Cds' From Genbank File Into Another Genbank File?

0

Entering edit mode

11.8 years ago

microbeatic ▴ 80

I have a genbank file with multiple genomes (concatenated using cat). I want to produce a subset of this file with only 'CDS' sequences. It seems very trivial, but i am having trouble with this. Here is the code which succesfully prints each CDS record, but fails to produce a genbank file with those CDS.

from Bio import SeqIO,SeqFeature
import sys

gbank=SeqIO.parse(open(sys.argv[1],"rU"),"genbank"

for genome in gbank:
    print "looking in %s" %genome.id)
    for gene in genome.features:
        if gene.type == 'CDS':
            CDS=gene
            print CDS

output_handle=open("all_CDS.gbk","w")
SeqIO.write(CDS,output_handle,"genbank")    
output_handle.close()

The code prints CDS, but it produces following error at the end.

AttributeError: 'int' object has no attribute 'name'

genbank biopython • 13k views

ADD COMMENT • link updated 11.8 years ago by Peter 6.0k • written 11.8 years ago by microbeatic ▴ 80

score 3 · Answer 1 · 2013-10-08

3

Entering edit mode

11.8 years ago

Peter 6.0k

The error you are getting is because you are not giving SeqIO.write a SeqRecord object, but a single SeqFeature object (the final CDS looked at in the loop).

You need to edit the SeqRecord object to remove the unwanted SeqFeature objects, and then save that, e.g.

from Bio import SeqIO
record = SeqIO.parse("original.gbk","genbank")
record.features = [f for f in record.features if f.type == "CDS"]
SeqIO.write(record, "only_cds.gbk", "genbank")

That should give you a simpler GenBank file where the feature table only contains the CDS features. I'm not sure why you'd do that, but that is what I think you are asking for.

ADD COMMENT • link 11.8 years ago by Peter 6.0k

0

Entering edit mode

Thanks @Peter. It worked. I needed to "slimdown" the genbank file which i was using to extract location information and add to a sequence header. I am pretty novice to biopython and python (a month), so i realize its highly inefficient.

ADD REPLY • link 11.8 years ago by microbeatic ▴ 80