Question: duplicating features in genbank file with biopython
0
gravatar for s.vandenhurk
6.0 years ago by
s.vandenhurk10
Netherlands
s.vandenhurk10 wrote:

I have got a lot of genbank files with multiple genes in them, some of these genes have a single start and stop position e.g. 1000..1390 and some have multiple start and stop positions e.g. join(1000..1390,1400..1790,1900..2275)

I want to duplicate the entire CDS for the genes with multiple start and stop positions and insert only 1 start and stop position for every duplicate.

So 1 CDS with 3 starts/stops should become 3CDS with 1start/stop each.

 

Anyone got a clue on how to achieve this?

genbank biopython • 1.4k views
ADD COMMENTlink modified 6.0 years ago by Peter5.9k • written 6.0 years ago by s.vandenhurk10

are you sure those "multiple start stop CDS" are not in fact indicating the intron/exon boundaries?

ADD REPLYlink written 6.0 years ago by Whetting1.5k
0
gravatar for Peter
6.0 years ago by
Peter5.9k
Scotland, UK
Peter5.9k wrote:

Unfortunately the answer is you shouldn't be doing this.

As per @Whetting's comment this is a meaningless question. Coding sequence (CDS) features like join(1000..1390,1400..1790,1900..2275) are generally indicating splicing (intron/exon boundaries) or in some cases ribosomal slippage.

Each of these regions in itself is not a CDS. It may not be a multiple of three in length, and may not be in-frame. You shouldn't therefore replace this complex CDS feature with the "CDS" features.

What you might meaningfully do is replace every CDS record with one or more exon records (each with simple coordinates, i.e. one start/stop), but then it wouldn't be a normal GenBank file any more.

ADD COMMENTlink modified 6.0 years ago • written 6.0 years ago by Peter5.9k

is there a way to do this with biopython? and if so, where can I find a guide on how to? I don't mind the fact it wouldn't be a normal GenBank file because I'm the end user of these files in my occasion.

ADD REPLYlink written 6.0 years ago by s.vandenhurk10

The CompoundLocation has a list of child locations which are simple FeatureLocation objects which you should re-use as the location of a new SeqFeature for each part. See the built in help (docstrings) for these objects. e.g. http://biopython.org/DIST/docs/api/Bio.SeqFeature.CompoundLocation-class.html

ADD REPLYlink written 6.0 years ago by Peter5.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1900 users visited in the last hour