Question: Biopython error parsing feature locations in GenBank file
1
gravatar for Christelle
3.4 years ago by
Christelle20
United Kingdom
Christelle20 wrote:

Hi there,

I attempted to parse an old GenBank file (see URL below) to extract various features (which I would then want to write in GFF format). I encountered an error (please see below) when parsing the GenBank file with Biopython (version 1.64) using SeqIO.parse method to access the records.

GenBank file: ftp://ftp.ensembl.org/pub/release-22/human-22.34d/data/flatfiles/genbank/Homo_sapiens.3000.dat.gz

Biopython error: /opt/apps/python/2.7.3/lib/python2.7/site-packages/Bio/GenBank/__init__.py:1108: BiopythonParserWarning: Couldn't parse feature location: 'AL358792.24.1.166931:3274..3461'
  % (location_line)))

I looked at the Bio/GenBank/__init__.py file and found many regular expressions that check the format of the feature locations and these regexps seem to include the format of the location I encounter i.e. 'AL358792.24.1.166931:3274..3461' (please see the example regexp below for complex location from the __init__.py file). So I am not quite sure why the code raises the BiopythonParserWarning error.

Regexp in Bio/GenBank/__init__.py: _complex_location = r"([a-zA-z][a-zA-Z0-9_]*(\.[a-zA-Z0-9]+)?\:)?(%s|%s|%s|%s|%s)" % (_pair_location, _solo_location, _between_location, _within_location, _oneof_location)

Could anybody please help me solve this parsing issue?

Thank you very much for your help.

parser genbank biopython format • 1.7k views
ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by Christelle20
1

Thank you Peter for investigating this and reporting the issue on the Biopython bug listing. Very much appreciated your help with this.

ADD REPLYlink written 3.4 years ago by Christelle20

You described a warning which I can reproduce (well, lots and lots of warnings about problematic locations, probably due to the period/dots in the sequence reference name), but what is the error? I don't see any exception and traceback in your question - or do you mean how can we fix the warning?

ADD REPLYlink written 3.4 years ago by Peter5.8k

You're right, as such there is no error triggered but a warning is raised to do with the impossibility to parse the problematic feature locations. So yes, I meant how can we fix the BiopythonParserWarning issue - so that we can retrieve the locations for those "problematic" features. Thank you very much for your help.

ADD REPLYlink written 3.4 years ago by Christelle20
2
gravatar for Peter
3.4 years ago by
Peter5.8k
Scotland, UK
Peter5.8k wrote:

I can't find anything in http://www.insdc.org/files/feature_table.html#3.4 to say these locations are invalid, nor in 3.4.12.2 Feature Location of ftp://ftp.ncbi.nih.gov/genbank/gbrel.txt - but on the other hand they don't explicitly describe the valid form of an external reference within a feature location. So this does look like a well defined corner case where the Biopython parser needs to be extended: 

https://github.com/biopython/biopython/issues/677

Update: This has been fixed by relaxing the regular expression to allow multiple dots within the external reference name of an INSDC Feature Location, and would be included with Biopython 1.67 onwards.

Christelle: Until Biopython 1.67 is released, to get this fix I would recommend installing Biopython from git from source. However, if the compilation would be a hassle (e.g. use on Windows), you would be safe updating to Biopython 1.66 and then manually updating just Bio/GenBank/__init__.py with the new regex as in the commit linked to from the Biopython issue.

Note you should update from Biopython 1.64 anyway due to a problem with compound locations from EMBL/ENSEMBL style joins which was fixed in Biopython 1.66.

ADD COMMENTlink modified 3.4 years ago • written 3.4 years ago by Peter5.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 681 users visited in the last hour