Question

duplicating features in genbank file with biopython

0

Entering edit mode

9.4 years ago

s.vandenhurk ▴ 10

I have got a lot of genbank files with multiple genes in them, some of these genes have a single start and stop position e.g. 1000..1390 and some have multiple start and stop positions e.g. join(1000..1390,1400..1790,1900..2275)

I want to duplicate the entire CDS for the genes with multiple start and stop positions and insert only 1 start and stop position for every duplicate.

So 1 CDS with 3 starts/stops should become 3CDS with 1start/stop each.

Anyone got a clue on how to achieve this?

biopython genbank • 2.4k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.4 years ago by s.vandenhurk ▴ 10

0

Entering edit mode

are you sure those "multiple start stop CDS" are not in fact indicating the intron/exon boundaries?

ADD REPLY • link 9.4 years ago by Whetting ★ 1.6k

score 0 · Answer 1 · 2014-12-01

0

Entering edit mode

9.4 years ago

Peter 6.0k

Unfortunately the answer is you shouldn't be doing this.

As per @Whetting's comment this is a meaningless question. Coding sequence (CDS) features like join(1000..1390,1400..1790,1900..2275) are generally indicating splicing (intron/exon boundaries) or in some cases ribosomal slippage.

Each of these regions in itself is not a CDS. It may not be a multiple of three in length, and may not be in-frame. You shouldn't therefore replace this complex CDS feature with the "CDS" features.

What you might meaningfully do is replace every CDS record with one or more exon records (each with simple coordinates, i.e. one start/stop), but then it wouldn't be a normal GenBank file any more.

ADD COMMENT • link 9.4 years ago by Peter 6.0k

0

Entering edit mode

is there a way to do this with biopython? and if so, where can I find a guide on how to? I don't mind the fact it wouldn't be a normal GenBank file because I'm the end user of these files in my occasion.

ADD REPLY • link 9.4 years ago by s.vandenhurk ▴ 10

0

Entering edit mode

The CompoundLocation has a list of child locations which are simple FeatureLocation objects which you should re-use as the location of a new SeqFeature for each part. See the built in help (docstrings) for these objects. e.g. http://biopython.org/DIST/docs/api/Bio.SeqFeature.CompoundLocation-class.html

ADD REPLY • link 9.4 years ago by Peter 6.0k