I am trying to parse Humicola genome "embl" format, and extract all its gene id. I quite new in this stuff, so may need some help in termology as well ! Seems in the file, there is more than one section for features (meaning it could be more than 1 genome).how do you call these part in flatfile ?!
Anyway, so for this purpose, I wrote a small python script and used BioPython,
embl = dict() handle = open('MyGenome.embl', "rU") for cnt, record in enumerate(SeqIO.parse(handle, "embl")) : print len(record.features) print cnt handle.close()
but I am getting the following error in 10th record.features.(could be because its corrupted or there is bug withing my biopython (version 1.58) or may be you know an easier parse!)
here is the error
File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 432, in parse_records record = self.parse(handle, do_features) File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 415, in parse if self.feed(handle, consumer, do_features): File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 387, in feed self._feed_feature_table(consumer, self.parse_features(skip=False)) File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 184, in parse_features features.append(self.parse_feature(feature_key, feature_lines)) File "/usr/lib/pymodules/python2.6/Bio/GenBank/Scanner.py", line 281, in parse_feature assert len(qualifiers) > 0
Where I track genes that I am tracking, the one after last one in embl file is like
FT exon complement(494..1420) FT /note="exon_id=CADAORAE00020594"
and after that, another section (genome) start,which is look like this
ID XCFFLD351 Unannotated; unspecified; UNC; 494655 BP. XX AC Z2573200; XX DT 01-OCT-2010 (Rel. 01, Created) DT 01-OCT-2010 (Rel. 01, Last updated, Version 1) XX DE Humicola xisolens XX KW genomic. XX OS Humicola xisolens O45MC-2 OC Eukaryota; Fungi; OC . XX RN  RP 1-494655 RA MXKA; RT Humicola xisolens RL firstname.lastname@example.org; XX DR seq; PC; O45PC. CC length=494655 XX FH Key Location/Qualifiers FH FT CDS complement(join(178..370,439..529,588..678,1009..1060, FT 1584..1773,1831..1948,2838..2879)) FT /protein_id="PE04150001210" FT /db_xref="PEDANT:PE04150001210"
Also, I am wondering how can I extract whole the part that causing error, and paste here ?!! I know its 10th record.feature! here is the file
Thank you for providing the file. There were two issues that were giving the parser problems. Location lines were implicitly wrapped with trailing parentheses:
FT CDS complement(join(263268..263520,263628..264668, FT 264734..264952,265020..265319,265370..265444,265507..265541 FT )) FT /protein_id="PE04150000285" FT /db_xref="PEDANT:PE04150000285"
This caused the parser to misinterpret the '))' as a feature.
The second issue was the header line, which did not specify the type (DNA, RNA or protein):
ID XCFFLD170 Unannotated; unspecified; UNC; 616531 BP.
I checked in fixes to handle both of these issues, along with tests, in this commit:
You can update to the GitHub version, apply the fixes manually to
Bio/GenBank/__init__.py or fix your
EMBL files to avoid those two issues. Hope this helps.