Should We Be Using Much More Json In Our Delimited Data Formats?
6
14
Entering edit mode
13.2 years ago
Casbon ★ 3.3k

Many bioinformatic files are delimited because delimited files are so useful for manipulation and exchange. However, in order to be flexible these delimited files contain a column which can contain a property list. For example, GFF3 says that the attributes column is:

A list of feature attributes in the format tag=value. Multiple tag=value pairs are separated by semicolons. URL escaping rules are used for tags or values containing the following characters: ",=;". Spaces are allowed in this field, but tabs must be replaced with the %09 URL escape.

VCF uses the GFF3 style for the INFO field and also has a FORMAT field which describes the colon separated calls. So it has two ad-hoc embeddings in one file format.

But another big problem is that the data is untyped, requiring constant int, float, string manipulations and checking for null values. For example, '.' is often used to repesent null values.

Would our lives be simpler if we just agreed that all these entries in a delimited file should be JSON? i.e. that each entry in a tab delimited format is valid JSON. This is not a 'just use JSON' statement, but saying use JSON as a sane, backward compatible, encoding for delimited data.

If we were to stick to this convention, we would get type information and property lists no problem. Here are some examples of valid JSON:

  • 1 - the number one
  • "foo" - the string 'foo'
  • [1, "a"] - a list with the number one followed by the string 'a'
  • {"eggs": false, "spam": true} - a property list (or map, dictionary or whatever you call them) with 'spam' set to true and 'eggs' set to false
  • 1e-10 - a small number

So I might have a file that looks like this:

name  Age info
"Foo" 32  {'k': 'v', 'k2': 1e-10}

A simple parser would then look like this:

for line in file:
    yield map(json.loads, line.split('\t'))

(Although clearly we need to encode tabs carefully when writing the file).

I can't believe how many crappy, half specificied formats I've dealt with. So my question is: is this a good idea and should we try and convince GFF/VCF/etc to embrace this in their specs?

EDIT: make it clearer this is not just use JSON, but a new kind of delimited JSON.

gff vcf • 8.0k views
ADD COMMENT
2
Entering edit mode

I think your supposed to +1 the question ;)

ADD REPLY
2
Entering edit mode

So, instead of sticking to well-known formats like CSV and JSON, you are proposing to mix those two together? That immediately means that you cannot use a CSV or JSON library on its own anymore, but that you have to mingle them together instead. I would say use JSON if you have to deal with structured data that will not fit CSV easily, and instead of introducing delimiters, just use a JSON array. To read-in that file in a program, you then only need only line and your whole data structure is in memory.

ADD REPLY
0
Entering edit mode

Yes, please. JSON and YAML are awesome formats; plenty of parsers, human readable and concise. Not sure how to answer your question, so how about just a +1.

ADD REPLY
0
Entering edit mode

I'll take it that's a no ;)

The point is that its backward compatible. I mean csv libraries dont handle type, so you need to layer that over the top anyway.

The reason people don't just use JSON or XML is that: you want to be able to use AWK you want to be able to use excel * you want to be able to use grep

i.e. you need a delimited file. I am trying to suggest a decent encoding for a delimited file.

ADD REPLY
0
Entering edit mode

I get your point, but I would still not mix up formats. If you need a type for your data, stick to CSV/TSV and add a 'datatype' prefix/postfix. For example: Hello^STRING, because then you can still grep/awk easily, you can access the datatype smoothly too, and just pick a symbol as a separator ("^" here) that does not appear as datavalue. If you insist on using Excel, you can even have one field data, next field its datatype, next field data again, then next field type, etc.

ADD REPLY
0
Entering edit mode

So, you're saying it's better to use a prefix type? That doesn't even demonstrate backward compatibility and would produce a load of nonsense loading into R/excel

ADD REPLY
0
Entering edit mode

That is why I said you can also have even fields denoting data and odd fields denoting type. That would be an alternative that also works with R and Excel.

ADD REPLY
0
Entering edit mode

Adds a load of noise and has no exising parsers, though.

ADD REPLY
3
Entering edit mode
13.2 years ago

I would agree that for attribute representation the json encoding would be far superior to the current GFF syntax.

Now there is one overall drawback of formats, most tools start to assume that you always need to parse the entire line to find out what it is stored in it. This can be greatly counterproductive - it is akin to GFF parsers that insist of parsing and interpreting the entire (possibly very lengthy) attribute even when all one needs is to select all starts for a given chromosome. Thus you end up with code that is orders of magnitude slower than a three line python code you could write yourself.

ADD COMMENT
0
Entering edit mode

It would be easy to write a parser that doesn't need to parse the whole file. You take a "conventional" (non streaming) JSON parser, and write a parser that, IF the root element is an array, read each element of the array as a string, then feed it to a conventional (non streaming) parser. This such a parser would only be a few lines of code (since it reuses another JSON parser), and you get a streaming/gradual parser.

ADD REPLY
2
Entering edit mode
13.2 years ago

I would argue that delimited files only appear useful for manipulation and exchange unless they express no more than the simplest possible data.

I don't like to see compound values in tab-delimited files because compound values indicate that the data are hierarchical. I think that it's cleaner and simpler to use a uniform hierarchical representation in such cases. Having to special-case the top level actively discourages the natural recursive approach. A regularly structured data stream and an event-driven parser fills me with peace and joy. The alternatives lead to rage and hate.

ADD COMMENT
0
Entering edit mode

I agree that compound values in delimited files suck, but it looks like all the formats that we have rely on them.

ADD REPLY
2
Entering edit mode
12.3 years ago
Erik Garrison ★ 2.4k

I attempted to design a JSON schema for encoding the same class of data as resides in VCF, https://github.com/ekg/jvcf.

Reading it now with several years of experience to reflect on, I don't see it as a completely correct design. Hopefully, sharing it here can motivate thought on exactly the point that the OP raises.

ADD COMMENT
1
Entering edit mode
13.2 years ago
Mitch Skinner ▴ 660

I'm a big fan of JSON, but I'm not sure that we need a new hybrid format.

You say that plain JSON isn't enough because you can't grep it or use awk, but I think that's just an argument to create a tool that can process JSON in a streaming way with a simple mechanism for pulling out subsets (grep for JSON) and doing basic manipulations on them (awk for JSON).

One of the things I really like about JSON is that you no longer have to worry about underspecified encoding issues (like: what happens when your data contains embedded tabs and newlines?). But a hybrid like you propose would require a separate escaping mechanism, and then you're starting to edge toward a whole new format of your own; I'm not sure that a well-defined hybrid would be any easier to handle than JSON itself. For example, your example parsing code is incomplete because it doesn't unescape embedded tabs/newlines.

Plus, if we had an awk/grep-alike for JSON, then (since it would understand JSON syntax) it could be easier to use than using awk and grep on this hybrid you're proposing (I'm trying to imagine grepping for nested JSON structure on such a hybrid, and at first blush it seems complex).

ADD COMMENT
0
Entering edit mode

I'm not really proposing a new data format, but observing that we have all these delimited exchange formats that have no spec for coding. Why not just agree on the JSON spec for cell coding, since it is so unobtrusive?

ADD REPLY
0
Entering edit mode
13.2 years ago
Neilfws 49k

I like JSON a lot - as a data interchange format, for fetching data from a RESTful web service, parsing into a hash, storing in a database. I'm not convinced that it makes a good general, text document format.

There are those who say "JSON for interchange, XML for documents." And those who argue for one, the other or all points inbetween. Try "JSON versus XML" as a Google search query, for the arguments (and some entertainment).

ADD COMMENT
0
Entering edit mode

I'm making a more nuanced point than just use JSON. i.e. using JSON embedded in delimited files.

ADD REPLY
0
Entering edit mode

Right. Well, why not, it could work. Although in a way, my point about JSON as text versus JSON as an "ephemeral" interchange between server and client still hold.

ADD REPLY
0
Entering edit mode
13.2 years ago
lh3 33k

Using Json brings a lot of troubles to other programming languages. The standard C/C++ library does not come with a Json parser; neither Perl (also true for Ruby?). If you use a 3rd-party parser, you get a library dependency; if you implement it by yourself, it is not trivial. I see this is the major problem.

Also, quotation marks will make AWK less effective. With awk, you need to test:

awk '$1=="\"Foo\""'

As to VCF, I agree VCF should use one delimiter rather than both ; and :, but doing that only makes VCF look cleaner. No real effect.

EDIT: responses to the comments:

  1. Python is not perfect. I buy that it is better than Perl overall, but there are areas where Perl does better especially in Bioinformatics. Also, by design python is not the most elegant language. It is tens of times slower than a couple of other scripting languages, too.
  2. For C/C++ programming, library dependency is a big concern. You use C/C++ because you care about efficiency, but my experiences keep telling me: never trust the coding quality of 3rd-party libraries. Look at the XML parsers. There are many, but only libxml seems to implement the right way. Who knows the quality of these 12+6 json libraries? I bet most do not strive for efficiency. Another problem is licensing.
  3. Most scientific C/C++ software do not use encoding libraries because that leads to nothing but distraction.
ADD COMMENT
2
Entering edit mode

The best way to prove JSON is better than GFF/VCF is to get a JSON-based GFF/VCF widely accepted by the community. Replacing GFF may be too late, but replacing VCF is possible. Nonetheless, you should beware that nearly all widely used formats were designed by the most influential people in Bioinformatics. This at least means something...

ADD REPLY
1
Entering edit mode

12 libs for C, 6 for C++, 1 for perl http://www.json.org/

But look, people are using encodings so why not encourage one with a spec?

ADD REPLY
1
Entering edit mode

Plus if you're using perl you have a bigger problem than a lack of a JSON parser ;)

ADD REPLY
1
Entering edit mode

Influential doesn't always meen correct. The most influential people I've met in bioinformatics tend to be biologists first, and software engineers a VERY distant second.

ADD REPLY
1
Entering edit mode

@desaila: we are discussing engineering problems here and of course I was mainly talking in the engineering aspect. It is true that the most influential people in general have a very strong biology background. You comment reminds me of my only negatively voted answer. I am still right: VCF is becoming the standard and JSON is largely ignored in NGS data processing. Also, I wrote more C/C++ programs than >99% of people in this forum. I am fairly certain of my judgment of what is a good C/C++ program.

ADD REPLY
0
Entering edit mode

There is a JSON gem for Ruby: gem install json

Just saying. :)

ADD REPLY
0
Entering edit mode

Only kidding about perl, but... it seems like you hate dependencies, but wouldn't you prefer it if people stuck to a widely accepted spec for encoding values? The GFF3/VCF one IS worse than JSON.

ADD REPLY
0
Entering edit mode

@lh3 Why not using JSON at least for the INFO column in the VCF ?

ADD REPLY
0
Entering edit mode

@Pierre an unnecessary library dependency alone will push me away.

ADD REPLY
0
Entering edit mode

@Pierre an unnecessary library dependency alone is enough to push me away. Sorry, probably I am overacting, but this is how I write programs.

ADD REPLY
0
Entering edit mode

I do understand your point of view concerning the dependencies. But, for the INFO column, instead of writing "DP=3;AF1=0.5716" the tools generating VCF, including yours, could write '{"DP":3,"AF":0.5716}'. This would allow to put some structured data for the annotation (gene, more than one mRNA }. Then for the analysis/parsing, people would use what they want. But may be biostar is not the best place to discuss this.

ADD REPLY
0
Entering edit mode

... and ok, it's not easy to parse JSON with awk, etc...

ADD REPLY
0
Entering edit mode

... but could we add an option in bcftools to print JSON for INFO ? would you accept a pull request ?

ADD REPLY

Login before adding your answer.

Traffic: 1508 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6