Question: Failure to extract CDS from embl file using BioPython
gravatar for rjqmantaring
11 months ago by
rjqmantaring0 wrote:

I'm pretty new to BioPython and I'm trying to use it to extract all of the CDS features from a .embl file. This is my code:


for rec in SeqIO.parse("file.embl", "embl"):
if rec.features:
for feature in rec.features:
      if feature.type == "CDS":
            print (feature.qualifiers["protein_id"])
            print (feature.location.extract(rec).seq)

When I run my code I get the following error:

Traceback (most recent call last):
File "", line 5, in <module>
 record ="file.embl", "embl")
File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/", line 720, in read
 first = next(iterator)
File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/", line 655, in parse
 for r in i:
File "/usr/lib/python2.7/dist-packages/Bio/GenBank/", line 489, in parse_records
 record = self.parse(handle, do_features)
File "/usr/lib/python2.7/dist-packages/Bio/GenBank/", line 473, in parse
 if self.feed(handle, consumer, do_features):
File "/usr/lib/python2.7/dist-packages/Bio/GenBank/", line 440, in feed
 self._feed_first_line(consumer, self.line)
File "/usr/lib/python2.7/dist-packages/Bio/GenBank/", line 661, in _feed_first_line
 raise ValueError('Did not recognise the ID line layout:\n' + line)
ValueError: Did not recognise the ID line layout:
ID                   file ; ; ; ; ; 29902 BP.

I can't seem to find any relevant documentation or forum post on that specific error message. Can anyone help me figure out what's going on?

Thanks in advance.

biopython • 319 views
ADD COMMENTlink written 11 months ago by rjqmantaring0

Is this the first line of your file?

ID                   file ; ; ; ; ; 29902 BP.

Extracting more features from EMBL files with Biopython

Problem With Parsing Genome File - Embl Format - With Biopython

ADD REPLYlink modified 11 months ago • written 11 months ago by Fatima930

Yes. Its an embl file that as generated by transferring annotations from a GenBank file to an unannotated FASTA.

ADD REPLYlink written 11 months ago by rjqmantaring0

There should be 2 or 3 or 6 semicolons (there's 5 in your header).

Here is the part of the script that generates the error:

def _feed_first_line(self, consumer, line):
        assert line[: self.HEADER_WIDTH].rstrip() == "ID"
        if line[self.HEADER_WIDTH :].count(";") == 6:
            # Looks like the semi colon separated style introduced in 2006
            self._feed_first_line_new(consumer, line)
        elif line[self.HEADER_WIDTH :].count(";") == 3:
            if line.rstrip().endswith(" SQ"):
                # EMBL-bank patent data
                self._feed_first_line_patents(consumer, line)
                # Looks like the pre 2006 style
                self._feed_first_line_old(consumer, line)
        elif line[self.HEADER_WIDTH :].count(";") == 2:
            # Looks like KIKO patent data
            self._feed_first_line_patents_kipo(consumer, line)
            raise ValueError("Did not recognise the ID line layout:\n" + line)
ADD REPLYlink written 11 months ago by Fatima930

thanks, I'll try looking into this.

ADD REPLYlink written 11 months ago by rjqmantaring0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2737 users visited in the last hour