Question

How To Parse 'Embl' Annotation File

0

Entering edit mode

10.5 years ago

fm271 ▴ 20

I have following content in my embl annotation file. I am trying to parse it as it is done for "genBank" files, but I am repeatedly getting error. How to read similar files using biopython?

I am using the following document as my guide: http://biopython.org/DIST/docs/api/Bio.SeqIO-module.html

ID   NRP00000001; PRT; NR2; 1 SQ
XX
MF   10830627
PN   WO9954462
PR   GB19980008350 22-APR-1998
ED   28-OCT-1999 WO9954462 A2
XX
DR   EPOP:AX013047;
DE   Sequence 74 from Patent WO9954462. 
PN   WO9954462-A2/74, 28-OCT-1999
XX
FT   source          1..358
FT                   /organism="Mycobacterium leprae"
FT                   /mol_type="protein"
FT                   /db_xref="taxon:1769"
XX
SQ   Sequence 358 AA; 00001508eba3f78863a4f9cb2463810d; MD5;
//
ID   NRP00000002; PRT; NR2; 1 SQ
XX
MF   22767515
PN   WO0190366
PR   US20000206690P 24-MAY-2000
ED   29-NOV-2001 WO0190366 A2
XX
DR   EPOP:AX312021;
DE   Sequence 5006 from Patent WO0190366. 
PN   WO0190366-A2/5006, 29-NOV-2001
XX
FT   source          1..65
FT                   /organism="Homo sapiens"
FT                   /mol_type="protein"
FT                   /db_xref="taxon:9606"
XX
SQ   Sequence 65 AA; 0000eece8396364fe22b1bdd6821bd63; MD5;
//
ID   NRP00210944; PRT; NR2; 2 SQ
XX
MF   9921525
PN   WO03020945
PR   GB20010021439 05-SEP-2001
ED   13-MAR-2003 WO03020945 A2
XX
DR   EPOP:AX716885;
DE   Sequence 1 from Patent WO03020945. 
PN   WO03020945-A2/1, 13-MAR-2003
XX
DR   USPOP:ABY00072;
DE   Sequence 1 from patent US 7294486. 
PN   US7294486-A/1, 13-NOV-2007
PN   US2005130274 A1 16-JUN-2005
CC   First level of publication supplied by the EPO
XX
FT   source          1..25
FT                   /organism="Streptomyces cattleya"
FT                   /mol_type="protein"
FT                   /db_xref="taxon:29303"
XX
SQ   Sequence 25 AA; 000114cdf14c72e3b188040f9f35f5af; MD5;
//
ID   NRP00210945; PRT; NR2; 1 SQ
XX
MF   9954057
PN   WO2004078914
PR   GB20030004882 04-MAR-2003
ED   16-SEP-2004 WO2004078914 A2
XX
DR   EPOP:CQ871087;
DE   Sequence 7 from Patent WO2004078914. 
PN   WO2004078914-A2/7, 16-SEP-2004
XX
FT   source          1..25
FT                   /organism="unidentified"
FT                   /mol_type="protein"
FT                   /note="Sequence of unknown origin"
FT                   /db_xref="taxon:32644"
XX
SQ   Sequence 25 AA; 000114cdf14c72e3b188040f9f35f5af; MD5;
//

Reading gives me following error:

>>> SeqIO.read(emblFile, "embl")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 599, in read
    first = iterator.next()
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 537, in parse
    for r in i:
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 445, in parse_records
    record = self.parse(handle, do_features)
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 428, in parse
    if self.feed(handle, consumer, do_features):
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 395, in feed
    self._feed_first_line(consumer, self.line)
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 585, in _feed_first_line
    self._feed_first_line_old(consumer, line)
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 610, in _feed_first_line_old
    self._feed_seq_length(consumer, fields[4])        
IndexError: list index out of range

Pasrsing gives me following error:

>>> SeqIO.parse(emblFile, "embl").next()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/Bio/SeqIO/__init__.py", line 537, in parse
    for r in i:
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 445, in parse_records
    record = self.parse(handle, do_features)
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 428, in parse
    if self.feed(handle, consumer, do_features):
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 395, in feed
    self._feed_first_line(consumer, self.line)
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 585, in _feed_first_line
    self._feed_first_line_old(consumer, line)
  File "/Library/Frameworks/EPD64.framework/Versions/7.3/lib/python2.7/site-packages/Bio/GenBank/Scanner.py", line 610, in _feed_first_line_old
    self._feed_seq_length(consumer, fields[4])        
IndexError: list index out of range

parsing annotation biopython • 7.6k views

ADD COMMENT • link updated 10.5 years ago by Peter 6.0k • written 10.5 years ago by fm271 ▴ 20

1

Entering edit mode

Maybe BioPython can't parse the sequence properly because it's a md5 hash instead of the actual amino acid sequence.

ADD REPLY • link 10.5 years ago by Damian Kao 16k

0

Entering edit mode

I also figured out that Biopython is incapable to do so. Is there any other python module available that can do so?

ADD REPLY • link 10.5 years ago by fm271 ▴ 20

0

Entering edit mode

what information exactly do you need to extract?

ADD REPLY • link 10.5 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

I need most of the information in the above format. I decided to write my own parser to filter out these values. Thanks for comment

ADD REPLY • link 10.5 years ago by fm271 ▴ 20

score 3 · Answer 1 · 2013-10-24

3

Entering edit mode

10.5 years ago

Hamish ★ 3.2k

This is not EMBL-Bank format. It is instead the EMBL-like format used for the Patent NR databases (PMID:23396323). As such it will not work with standard EMBL-Bank format parsers.

As far as I am aware there are no parsers for this format outside of those used at EMBL-EBI to provide the data in SRS@EMBL-EBI, dbfetch/WSDbfetch and EBI Search/EB-eye. That said, since the differences are mostly limited to additional line types and the feature table format is that used by the INSDC databases (DDBJ, ENA EMBL-Bank and GenBank), it should be relatively easy to modify an existing EMBL-Bank entry format parser to handle these annotation only records. If you also need the sequence data you will have to get this from the accompanying fasta sequence format data files, or use the EMBL-EBI SRS@EMBL-EBI or dbfetch/WSDbfetch services to fetch the combined version of the entries, which include the annotation and the sequence.

Alternatively if you only need a few bits of information, it may be easier to use a custom parser to target the required information.

Update: see Peter's answer above for details of support in BioPython.

ADD COMMENT • link 10.5 years ago by Hamish ★ 3.2k

0

Entering edit mode

Thanks would write my own parser

ADD REPLY • link 10.5 years ago by fm271 ▴ 20

0

Entering edit mode

@fm271 The more community minded response would be "thanks, I'll try to enhance the Biopython parser to cope with this EMBL like format, and submit the code changes" and for bonus points "... as a pull request on GitHub including a unit test". Or, "I'll report this to the Biopython developers in case they don't see this question here."

But anyway, see my answer below ;)

ADD REPLY • link 10.5 years ago by Peter 6.0k

0

Entering edit mode

That's very helpful Hamish - as was this PDF explaining the EMBL-bank patent file fields: http://www.ebi.ac.uk/sites/ebi.ac.uk/files/groups/external_services/patentdata/Non-redundant_databases-user_manual_v3.pdf

ADD REPLY • link 10.5 years ago by Peter 6.0k

score 2 · Answer 2 · 2013-10-24

The next release (Biopython 1.63) should be able to parse these files as a variant of embl format (although not all the information would be captured, feedback welcome): https://github.com/biopython/biopython/commit/b06ab99c961a69356d7a20f3853bd851195930e6 & https://github.com/biopython/biopython/commit/559ffd7f83a52d8b1a8ee0802ebbb91c6dc9fa09

The brave can try this now by getting Biopython from GitHub and compiling it from source, or updating Biopython 1.62 with the two changed files (Bio/GenBank/Scanner.py and Bio/GenBank/__init__.py).

Note that Bio.SeqIO.read(…) is intended for files containing one and only one sequence, it should give an error on this file because there is more than one file. Use the Bio.SeqIO.parse(…) iterator instead.

Thanks Hamish for those links, this PDF was particularly helpful: http://www.ebi.ac.uk/sites/ebi.ac.uk/files/groups/external_services/patentdata/Non-redundant_databases-user_manual_v3.pdf