4.2 years ago by
Seattle, WA USA
Part of the deal with JSON is buy-in. You need people to be able to parse, chop, and filter it easily. There are tools like
jq and the like that make this doable on the command-line, but the interface has a learning curve.
Further, these tools are not part of a standard Unix setup, whereas utilities like
awk etc. are readily available, stable, with interfaces that haven't changed in years.
Also, you need a JSON structure that is consistent. Line-based formats have set fields separated by columns, or delimiters within a field for multiple records, so there is more consistency for some formats.
JSON is more open-ended in terms of what you can put into it, and that means more ambiguity when parsing. Some approaches to resolving this include enforcing schemas, e.g., the use of JSON Schema in other scientific fields (http://json-schema.org/) to enforce structure and type validation.
BSON (binary JSON) is an option for indexed lookups. One could imagine a future binary sequencing format that uses something like this, perhaps, but it isn't textual, so you'd need special tools to do queries and processing, much as non-standard tools are required for tabix.
The WashU regulatory browser uses a format for its annotations that mixes JSON and BED. The first three columns represent the position and size of a genomic interval, while the remainder is a JSON-like string (possibly strictly JSON, haven't looked in a while) that sets up key-value pairs for annotation attributes, like gene name, strand, etc. This hybrid approach gives the user the advantages of fast binary searches on sorted BED input, line parsing with common Unix tools, and the open-ended extensibility of a JSON object to describe attributes.