I have the output of an InterProScan 5 RC7 run in XML format. Unfortunately, I was unable to locate an appropriate parser. The BioPerl parser http://search.cpan.org/~cjfields/BioPerl/Bio/SeqIO/interpro.pm doesn't understand it, it seems to support up to version 4, error message:
no element found at line 206, column 0, byte 14045 at /opt/local/lib/perl5/site_perl/5.12.3/darwin-thread-multi-2level/XML/Parser.pm line 187
Also, this seems to be a known issue: https://redmine.open-bio.org/issues/3452
I was searching for a while but wasn't able to locate:
- a parser Bio* library in any language (except ofc generic XML parser, please do not recommend generic XML parsing)
- an XSLT stylesheet to convert interproscan 5 RC7 to e.g. interproscan 4 format
The schema definitions are here: http://www.ebi.ac.uk/interpro/resources/schemas/interproscan5 There are two of them, RC1-6 and RC7 schemata.
Maybe if you are among the responsible people for this project, you could also explain:
- why this drastic change to a completely different format
- without providing a parser or conversion solution
- or without informing the Bioperl/python communities?
Edit: Thank you for prooving me wrong, the conversion has been taken care of by the developers already from the beginning. So, we are just lacking native BioPerl/Python support.
This is how my file begins. I think that this is also a mistake because the schema is not referenced (correctly, nothing points to a correct schema version).
<protein-matches xmlns="<a href=" http:="" www.ebi.ac.uk="" interpro="" resources="" schemas="" interproscan5"="" rel="nofollow">http://www.ebi.ac.uk/interpro/resources/schemas/interproscan5"> <protein> <sequence md5="alksjfojfjkjhaiuy948iued">MCCXXX ...