Question: Parsing List Of Sequences In Embl Format
0
gravatar for Pappu
6.3 years ago by
Pappu1.9k
Pappu1.9k wrote:

I have downloaded a big list (~ 450 MB) of sequences in embl format from ebi. Now I want to make a python dictionary of identity and length of each sequence.

from Bio import SeqIO
length={}
handle = open('seq1.embl','r')
for record in SeqIO.parse(handle, "embl"):
    length[record.id[:-2]]=len(record.seq)
print length

However this gives the following error message:

Traceback (most recent call last):
  File "length.py", line 4, in <module>
    for record in SeqIO.parse(handle, "embl"):
  File "/usr/local/lib/python2.7/dist-packages/Bio/SeqIO/__init__.py", line 537, in parse
    for r in i:
  File "/usr/local/lib/python2.7/dist-packages/Bio/GenBank/Scanner.py", line 445, in parse_records
    record = self.parse(handle, do_features)
  File "/usr/local/lib/python2.7/dist-packages/Bio/GenBank/Scanner.py", line 428, in parse
    if self.feed(handle, consumer, do_features):
  File "/usr/local/lib/python2.7/dist-packages/Bio/GenBank/Scanner.py", line 405, in feed
    misc_lines, sequence_string = self.parse_footer()
  File "/usr/local/lib/python2.7/dist-packages/Bio/GenBank/Scanner.py", line 558, in parse_footer
    or self.line.strip() == '//', repr(self.line)
AssertionError: 'XX'

This code works perfectly for another small list of sequences in embl format. I used Biopython 1.60 and 1.60+, Both gives the same error.

python • 2.1k views
ADD COMMENTlink modified 6.3 years ago by Damian Kao15k • written 6.3 years ago by Pappu1.9k
1

Pappu, it seems that there is at least one erroneous record in your huge file. I would print out each record id when iterating over sequences. This will give you a clue in which part of the file biopython crashes. Then open the file in your text editor and locate the record that causes the error.

ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by a.zielezinski8.6k

Thanks. For example I get the same error when I try to parse this embl file in Biopython: http://www.ebi.ac.uk/ena/data/view/CM000771&display=txt&expanded=true

ADD REPLYlink modified 6.3 years ago • written 6.3 years ago by Pappu1.9k
2
gravatar for Damian Kao
6.3 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

The problem is with the CO lines in the file I think. The BioPython embl parser expects SQ and CO lines to be at the end of the file. Since your CO line is before all your FT lines, it couldn't find a stop line (//), instead found a XX line and threw an error.

ADD COMMENTlink written 6.3 years ago by Damian Kao15k

Thanks a lot. Removing the CO lines does the job.

ADD REPLYlink written 6.3 years ago by Pappu1.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1394 users visited in the last hour