Question: Failure to extract CDS from embl file using BioPython
gravatar for rjqmantaring
8 months ago by
rjqmantaring0 wrote:

I'm pretty new to BioPython and I'm trying to use it to extract all of the CDS features from a .embl file. This is my code:


for rec in SeqIO.parse("file.embl", "embl"):
if rec.features:
for feature in rec.features:
      if feature.type == "CDS":
            print (feature.qualifiers["protein_id"])
            print (feature.location.extract(rec).seq)

When I run my code I get the following error:

Traceback (most recent call last):
File "", line 5, in <module>
 record ="file.embl", "embl")
File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/", line 720, in read
 first = next(iterator)
File "/usr/lib/python2.7/dist-packages/Bio/SeqIO/", line 655, in parse
 for r in i:
File "/usr/lib/python2.7/dist-packages/Bio/GenBank/", line 489, in parse_records
 record = self.parse(handle, do_features)
File "/usr/lib/python2.7/dist-packages/Bio/GenBank/", line 473, in parse
 if self.feed(handle, consumer, do_features):
File "/usr/lib/python2.7/dist-packages/Bio/GenBank/", line 440, in feed
 self._feed_first_line(consumer, self.line)
File "/usr/lib/python2.7/dist-packages/Bio/GenBank/", line 661, in _feed_first_line
 raise ValueError('Did not recognise the ID line layout:\n' + line)
ValueError: Did not recognise the ID line layout:
ID                   file ; ; ; ; ; 29902 BP.

I can't seem to find any relevant documentation or forum post on that specific error message. Can anyone help me figure out what's going on?

Thanks in advance.

biopython • 240 views
ADD COMMENTlink written 8 months ago by rjqmantaring0

Is this the first line of your file?

ID                   file ; ; ; ; ; 29902 BP.

Extracting more features from EMBL files with Biopython

Problem With Parsing Genome File - Embl Format - With Biopython

ADD REPLYlink modified 8 months ago • written 8 months ago by Fatima830

Yes. Its an embl file that as generated by transferring annotations from a GenBank file to an unannotated FASTA.

ADD REPLYlink written 8 months ago by rjqmantaring0

There should be 2 or 3 or 6 semicolons (there's 5 in your header).

Here is the part of the script that generates the error:

def _feed_first_line(self, consumer, line):
        assert line[: self.HEADER_WIDTH].rstrip() == "ID"
        if line[self.HEADER_WIDTH :].count(";") == 6:
            # Looks like the semi colon separated style introduced in 2006
            self._feed_first_line_new(consumer, line)
        elif line[self.HEADER_WIDTH :].count(";") == 3:
            if line.rstrip().endswith(" SQ"):
                # EMBL-bank patent data
                self._feed_first_line_patents(consumer, line)
                # Looks like the pre 2006 style
                self._feed_first_line_old(consumer, line)
        elif line[self.HEADER_WIDTH :].count(";") == 2:
            # Looks like KIKO patent data
            self._feed_first_line_patents_kipo(consumer, line)
            raise ValueError("Did not recognise the ID line layout:\n" + line)
ADD REPLYlink written 8 months ago by Fatima830

thanks, I'll try looking into this.

ADD REPLYlink written 8 months ago by rjqmantaring0
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1363 users visited in the last hour