9.1 years ago by
Firstly, +1 to the two posts above. I have learned a lot. Here is an outsider's view:
- In what kind of biological data XML can be set as a standard file format?
XML is an ideal format to present complex data in small to medium size (e.g. <1GB). It is much clearer and much less error-prone. I see it has the potential to become the standard format for phylogenetic trees, BLAST output, GenBank/EMBL, PDB and network (already been). On the other hand, XML may not be appropriate for very large data. It is also a little overkilling for simple structureless data that can be represented in TAB-delimited formats.
- How XML have facilitated in general Bioinformatics?
XML usually eases format parsing once you get used to XML. As Daniel said, it avoids a lot of pitfalls in parsing formats, especially those formats without a formal spec (e.g. the standard BLAST output and the output of most programs).
However, as an outsider, I also see the following factors that hamper the adoption of XML in Bioinformatics.
Memory. After googling a few XML parser benchmarks (e.g. this one), I have the impression that many stream XML parsers may still use memory larger than the file itself. A few parsers (e.g. libxml2/PULL) implements streaming properly, but they are not the default parsers as in Perl and Python.
Speed. Without any evidence, I tend to believe parsing XML is slower than parsing a plain text file. Parsing XML is almost certainly slower than parsing specialized binary formats, probably a lot. This could be a concern for large data sets.
Other factors may be Unix unfriendliness and technical complexity, but perhaps once we get used to XML, these are not major concerns. I do not know.
EDIT: Let me elaborate more on the last bullet.
Unix unfriendliness. I know there are tools to covert XML to line-based format (I used them). But when we want to open multiple XML files without creating temporary files, it becomes a little painful, though solvable.
Technical complexity. I certainly do not mean using a DOM parser is complex, but using a SAX/StAX/PULL parser is more complicated. Another thing I mean by "complexity" is it is overkilling for very simple data.
modified 9.1 years ago
9.1 years ago by
lh3 ♦ 32k