Parsing List Of Sequences In Embl Format
1
0
Entering edit mode
8.3 years ago
Pappu ★ 1.9k

I have downloaded a big list (~ 450 MB) of sequences in embl format from ebi. Now I want to make a python dictionary of identity and length of each sequence.

from Bio import SeqIO
length={}
handle = open('seq1.embl','r')
for record in SeqIO.parse(handle, "embl"):
    length[record.id[:-2]]=len(record.seq)
print length

However this gives the following error message:

Traceback (most recent call last):
  File "length.py", line 4, in <module>
    for record in SeqIO.parse(handle, "embl"):
  File "/usr/local/lib/python2.7/dist-packages/Bio/SeqIO/__init__.py", line 537, in parse
    for r in i:
  File "/usr/local/lib/python2.7/dist-packages/Bio/GenBank/Scanner.py", line 445, in parse_records
    record = self.parse(handle, do_features)
  File "/usr/local/lib/python2.7/dist-packages/Bio/GenBank/Scanner.py", line 428, in parse
    if self.feed(handle, consumer, do_features):
  File "/usr/local/lib/python2.7/dist-packages/Bio/GenBank/Scanner.py", line 405, in feed
    misc_lines, sequence_string = self.parse_footer()
  File "/usr/local/lib/python2.7/dist-packages/Bio/GenBank/Scanner.py", line 558, in parse_footer
    or self.line.strip() == '//', repr(self.line)
AssertionError: 'XX'

This code works perfectly for another small list of sequences in embl format. I used Biopython 1.60 and 1.60+, Both gives the same error.

python • 2.9k views
ADD COMMENT
1
Entering edit mode

Pappu, it seems that there is at least one erroneous record in your huge file. I would print out each record id when iterating over sequences. This will give you a clue in which part of the file biopython crashes. Then open the file in your text editor and locate the record that causes the error.

ADD REPLY
0
Entering edit mode

Thanks. For example I get the same error when I try to parse this embl file in Biopython: http://www.ebi.ac.uk/ena/data/view/CM000771&display=txt&expanded=true

ADD REPLY
2
Entering edit mode
8.3 years ago

The problem is with the CO lines in the file I think. The BioPython embl parser expects SQ and CO lines to be at the end of the file. Since your CO line is before all your FT lines, it couldn't find a stop line (//), instead found a XX line and threw an error.

ADD COMMENT
0
Entering edit mode

Thanks a lot. Removing the CO lines does the job.

ADD REPLY

Login before adding your answer.

Traffic: 1321 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6