While building a parts library for internal use, I noticed the quirks of the GenBank format and also the fact that almost no GenBank file is up to spec. I started building a tool to iron out the quirks and salvage only the usable parts of GenBank feature annotations for use elsewhere. It has become a larger task than I initially anticipated and I thought some other people might find it useful or wish to contribute to it, so I made it open source:
In summary, this is:
- A Python package for use with Biopython
- It maps GenBank feature keys (and in some cases qualifiers) to Sequence Ontology terms.
- It fixes/normalizes GenBank feature qualifiers (annotations) and discards qualifiers that cannot be fixed. This is customizable to allow for adding your own salvaging code for certain qualifiers.
- The output is nice, predictable features that can be used elsewhere.
- Masochists can also use this package to simply clean up GenBank feature annotations into valid GenBank.
(A GFF3 exporter is also planned, but it may be a while.)