Parsing List Of Sequences In Embl Format
8.3 years ago
Pappu

I have downloaded a big list (~ 450 MB) of sequences in embl format from ebi. Now I want to make a python dictionary of identity and length of each sequence.

from Bio import SeqIO
length={}
handle = open('seq1.embl','r')
for record in SeqIO.parse(handle, "embl"):
length[record.id[:-2]]=len(record.seq)
print length


However this gives the following error message:

Traceback (most recent call last):
File "length.py", line 4, in <module>
for record in SeqIO.parse(handle, "embl"):
File "/usr/local/lib/python2.7/dist-packages/Bio/SeqIO/__init__.py", line 537, in parse
for r in i:
File "/usr/local/lib/python2.7/dist-packages/Bio/GenBank/Scanner.py", line 445, in parse_records
record = self.parse(handle, do_features)
File "/usr/local/lib/python2.7/dist-packages/Bio/GenBank/Scanner.py", line 428, in parse
if self.feed(handle, consumer, do_features):
File "/usr/local/lib/python2.7/dist-packages/Bio/GenBank/Scanner.py", line 405, in feed
misc_lines, sequence_string = self.parse_footer()
File "/usr/local/lib/python2.7/dist-packages/Bio/GenBank/Scanner.py", line 558, in parse_footer
or self.line.strip() == '//', repr(self.line)
AssertionError: 'XX'


This code works perfectly for another small list of sequences in embl format. I used Biopython 1.60 and 1.60+, Both gives the same error.

Pappu, it seems that there is at least one erroneous record in your huge file. I would print out each record id when iterating over sequences. This will give you a clue in which part of the file biopython crashes. Then open the file in your text editor and locate the record that causes the error.

Thanks. For example I get the same error when I try to parse this embl file in Biopython: http://www.ebi.ac.uk/ena/data/view/CM000771&display=txt&expanded=true

8.3 years ago

The problem is with the CO lines in the file I think. The BioPython embl parser expects SQ and CO lines to be at the end of the file. Since your CO line is before all your FT lines, it couldn't find a stop line (//), instead found a XX line and threw an error.

Thanks a lot. Removing the CO lines does the job.