Question: Extracting exons from CDS CompoundLocations of a genbank file
5.6 years ago by
lhirsch0 wrote:

Hi all,

 Using biopython, I'm dealing with a genbank file that only has CDS annotated as features.type . In order to extract the exon sequences in the whole genome, I'm trying to get their start and end positions from the FeatureLocations attribute, but I can't seem to understand how the CompoundLocation work.

For example:

 CompoundLocation([FeatureLocation(ExactPosition(368), ExactPosition(378), strand=1), FeatureLocation(ExactPosition(712), ExactPosition(1170), strand=1)], 'join')

 Using the record.features.location.[start|end].position I only get the start position of the first exon (368) and the end of the last exon (1170).

 Apparently the GenBank class has a function called _split_compound_loc() , but it only takes a list of the positions as an argument, which is exactly what I need in the first place.

 Is there a way to overcome these difficulties without having to parse the file manually? 

Many thanks

5.5 years ago by
Scotland, UK
Peter5.8k wrote:

Python methods starting with a single underscore are by convention private, and you are best off avoiding them.

If you want the CDS sequence (as explained in the documentation), use the .extract(...) method of the SeqFeature (or location object).

If you want the individual exons from a CDS feature, then they would be the individual parts of the CompoundLocation, accessed via (which is a list). For your example, this would be a list of FeatureLocation(ExactPosition(368), ExactPosition(378), strand=1) and FeatureLocation(ExactPosition(712), ExactPosition(1170), strand=1) only.

I suggest reading the docstrings, either directly within Python using the help(...) command, on GitHub , or here:

