Extracting All Cds From A Embl File
2
0
Entering edit mode
10.2 years ago
Pappu ★ 2.1k

I am trying to extract all the DNA sequence corresponding to the CDS using Biopython. However CDS region seems to be in different format in each embl file which makes it difficult to parse e.g.

join{[373615:374161](+), [0:174](+)}

[118940:>119261](-)

[<13907:>13991](+)

join{[426644:426858](+), [0:617](-)}

join{[5947..6076](+), [0..399](+)}

etc.

So I am wondering if there is any tool available for this purpose. Thank you.

biopython • 4.3k views
ADD COMMENT
0
Entering edit mode

The INSDC member databases (EMBL-EBI EMBL-Bank, NCBI GenBank and DDBJ) all use the same feature format, which is described in The DDBJ/EMBL/GenBank Feature Table Definition. See section "3.4.3 Location examples" for a set of examples illustrating the various possibilities for the feature location.

ADD REPLY
2
Entering edit mode
10.2 years ago
Peter 6.0k

Biopython will create a SeqFeature for each feature, including the CDS objects, with a complex location object (it has been parsed for you!). It provides an .extract(...) method precisely for this task - getting the sequence described. For examples, see:

Or, there is the built in help for the SeqFeature object.

ADD COMMENT
0
Entering edit mode

Thanks I just coded it myself spending few hours. The extract() option looks cool.

ADD REPLY
1
Entering edit mode
10.2 years ago
hpmcwill ★ 1.2k

While not in BioPython, it may provide a useful alternative... EMBOSS provides the extractfeat program to extract sequence data from a database entry based on a specific feature type (e.g. CDS).

ADD COMMENT
0
Entering edit mode

Biopython has EMBOSS bindings.

ADD REPLY
1
Entering edit mode
ADD REPLY

Login before adding your answer.

Traffic: 2044 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6