Question: Problem With Parsing Genome File - Embl Format - With Biopython
0
gravatar for User 1933
2.6 years ago by
User 1933160
User 1933160 wrote:

I am trying to parse Humicola genome "embl" format, and extract all its gene id. I quite new in this stuff, so may need some help in termology as well ! Seems in the file, there is more than one section for features (meaning it could be more than 1 genome).how do you call these part in flatfile ?!

Anyway, so for this purpose, I wrote a small python script and used BioPython,

embl = dict()
handle = open('MyGenome.embl', "rU")
for cnt, record in enumerate(SeqIO.parse(handle, "embl")) :
    print len(record.features)
    print cnt
handle.close()

but I am getting the following error in 10th record.features.(could be because its corrupted or there is bug withing my biopython (version 1.58) or may be you know an easier parse!)

here is the error

  File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 432, in parse_records
    record = self.parse(handle, do_features)
  File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 415, in parse
    if self.feed(handle, consumer, do_features):
  File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 387, in feed
    self._feed_feature_table(consumer, self.parse_features(skip=False))
  File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 184, in parse_features
    features.append(self.parse_feature(feature_key, feature_lines))
  File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 281, in parse_feature
    assert len(qualifiers) > 0

Where I track genes that I am tracking, the one after last one in embl file is like

FT   exon            complement(494..1420)
FT                   /note="exon_id=CADAORAE00020594"

and after that, another section (genome) start,which is look like this

ID   XCFFLD351  Unannotated; unspecified; UNC; 494655 BP.
XX
AC   Z2573200;
XX
DT   01-OCT-2010 (Rel. 01, Created)
DT   01-OCT-2010 (Rel. 01, Last updated, Version 1)
XX
DE   Humicola xisolens
XX
KW   genomic.
XX
OS   Humicola xisolens O45MC-2
OC   Eukaryota; Fungi;
OC   .
XX
RN   [1]
RP   1-494655
RA   MXKA;
RT   Humicola xisolens
RL   alaki@dolaki.com;
XX
DR   seq; PC; O45PC.
CC   length=494655
XX
FH   Key             Location/Qualifiers
FH
FT   CDS             complement(join(178..370,439..529,588..678,1009..1060,
FT                   1584..1773,1831..1948,2838..2879))
FT                   /protein_id="PE04150001210"
FT                   /db_xref="PEDANT:PE04150001210"

Also, I am wondering how can I extract whole the part that causing error, and paste here ?!! I know its 10th record.feature! here is the file

ADD COMMENTlink modified 2.6 years ago by Brad Chapman8.1k • written 2.6 years ago by User 1933160
1

Could you provide the identifier of the record you're having trouble with?

ADD REPLYlink written 2.6 years ago by Brad Chapman8.1k

I update the question, actually I am wondering what is the best way to extract that part ?! I know its 10th record

ADD REPLYlink written 2.6 years ago by User 1933160

Where is 'MyGenome.embl' from? To look at the 10th record, use 'less MyGenome.embl' and then search for the 10th record (with /ID). Then you can manually look at it to see if it's complete; the error seems to indicate something is wrong with the qualifiers in the feature table (FT section).

ADD REPLYlink written 2.6 years ago by Brad Chapman8.1k

To look at the 10th record, use 'less MyGenome.embl' and then search for the 10th record (with /ID). Then you can manually examine it to see if it's complete; the error seems to indicate something is wrong with the qualifiers in the feature table (FT section).

ADD REPLYlink written 2.6 years ago by Brad Chapman8.1k

the file is almost 110M, but I put new info, which could give you some clue.Thanks

ADD REPLYlink written 2.6 years ago by User 1933160

How was this file generated? It looks pretty non-standard, which might be what is causing issues. If you can't spot the problem by looking at the record, you can add a 'print line' to '/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py' in 'parse_feature' after 'for line in iterator:'. It's hard to help without being able to reproduce this; it would help if you could make the file available.

ADD REPLYlink written 2.6 years ago by Brad Chapman8.1k

I provide a link to file

ADD REPLYlink written 2.6 years ago by User 1933160
3
gravatar for Brad Chapman
2.6 years ago by
Brad Chapman8.1k
Boston, MA
Brad Chapman8.1k wrote:

Thank you for providing the file. There were two issues that were giving the parser problems. Location lines were implicitly wrapped with trailing parentheses:

FT   CDS             complement(join(263268..263520,263628..264668,
FT                   264734..264952,265020..265319,265370..265444,265507..265541
FT                   ))
FT                   /protein_id="PE04150000285"
FT                   /db_xref="PEDANT:PE04150000285"

This caused the parser to misinterpret the '))' as a feature.

The second issue was the header line, which did not specify the type (DNA, RNA or protein):

ID   XCFFLD170  Unannotated; unspecified; UNC; 616531 BP.

I checked in fixes to handle both of these issues, along with tests, in this commit:

https://github.com/biopython/biopython/commit/588e2113b5b264f314c4e603e44ad71b41ef7ffc

You can update to the GitHub version, apply the fixes manually to Bio/GenBank/Scanner.py and Bio/GenBank/__init__.py or fix your EMBL files to avoid those two issues. Hope this helps.

ADD COMMENTlink written 2.6 years ago by Brad Chapman8.1k

Hi Brad, I have a similar problem, where biopython "could not determine alphabet for seqtype". The header I have is like this: ID Pf3D701_v3.embl; SV ; ; ; ; ; 640851 BP. How would you suggest working around this? Is it possible to modify init.py to take this possibility into account? Cheers, Nicolas

ADD REPLYlink written 22 months ago by Nicojo1.0k
Please log in to add an answer.

Help
Access
  • RSS
  • Stats
  • API

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.0.0
Traffic: 332 users visited in the last hour