Question: Should We Be Using Much More Json In Our Delimited Data Formats?
14
gravatar for Casbon
8.1 years ago by
Casbon3.2k
Casbon3.2k wrote:

Many bioinformatic files are delimited because delimited files are so useful for manipulation and exchange. However, in order to be flexible these delimited files contain a column which can contain a property list. For example, GFF3 says that the attributes column is:

A list of feature attributes in the format tag=value. Multiple tag=value pairs are separated by semicolons. URL escaping rules are used for tags or values containing the following characters: ",=;". Spaces are allowed in this field, but tabs must be replaced with the %09 URL escape.

VCF uses the GFF3 style for the INFO field and also has a FORMAT field which describes the colon separated calls. So it has two ad-hoc embeddings in one file format.

But another big problem is that the data is untyped, requiring constant int, float, string manipulations and checking for null values. For example, '.' is often used to repesent null values.

Would our lives be simpler if we just agreed that all these entries in a delimited file should be JSON? i.e. that each entry in a tab delimited format is valid JSON. This is not a 'just use JSON' statement, but saying use JSON as a sane, backward compatible, encoding for delimited data.

If we were to stick to this convention, we would get type information and property lists no problem. Here are some examples of valid JSON:

  • 1 - the number one
  • "foo" - the string 'foo'
  • [1, "a"] - a list with the number one followed by the string 'a'
  • {"eggs": false, "spam": true} - a property list (or map, dictionary or whatever you call them) with 'spam' set to true and 'eggs' set to false
  • 1e-10 - a small number

So I might have a file that looks like this:

name  Age info
"Foo" 32  {'k': 'v', 'k2': 1e-10}

A simple parser would then look like this:

for line in file:
    yield map(json.loads, line.split('\t'))

(Although clearly we need to encode tabs carefully when writing the file).

I can't believe how many crappy, half specificied formats I've dealt with. So my question is: is this a good idea and should we try and convince GFF/VCF/etc to embrace this in their specs?

EDIT: make it clearer this is not just use JSON, but a new kind of delimited JSON.

vcf gff • 5.1k views
ADD COMMENTlink modified 8.1 years ago by Erik Garrison2.1k • written 8.1 years ago by Casbon3.2k
2

I think your supposed to +1 the question ;)

ADD REPLYlink written 8.1 years ago by Casbon3.2k
2

So, instead of sticking to well-known formats like CSV and JSON, you are proposing to mix those two together? That immediately means that you cannot use a CSV or JSON library on its own anymore, but that you have to mingle them together instead. I would say use JSON if you have to deal with structured data that will not fit CSV easily, and instead of introducing delimiters, just use a JSON array. To read-in that file in a program, you then only need only line and your whole data structure is in memory.

ADD REPLYlink written 8.1 years ago by Joachim2.8k

Yes, please. JSON and YAML are awesome formats; plenty of parsers, human readable and concise. Not sure how to answer your question, so how about just a +1.

ADD REPLYlink written 8.1 years ago by Brad Chapman9.3k

I'll take it that's a no ;)

The point is that its backward compatible. I mean csv libraries dont handle type, so you need to layer that over the top anyway.

The reason people don't just use JSON or XML is that: you want to be able to use AWK you want to be able to use excel * you want to be able to use grep

i.e. you need a delimited file. I am trying to suggest a decent encoding for a delimited file.

ADD REPLYlink written 8.1 years ago by Casbon3.2k

I get your point, but I would still not mix up formats. If you need a type for your data, stick to CSV/TSV and add a 'datatype' prefix/postfix. For example: Hello^STRING, because then you can still grep/awk easily, you can access the datatype smoothly too, and just pick a symbol as a separator ("^" here) that does not appear as datavalue. If you insist on using Excel, you can even have one field data, next field its datatype, next field data again, then next field type, etc.

ADD REPLYlink written 8.1 years ago by Joachim2.8k

So, you're saying it's better to use a prefix type? That doesn't even demonstrate backward compatibility and would produce a load of nonsense loading into R/excel

ADD REPLYlink written 8.1 years ago by Casbon3.2k

That is why I said you can also have even fields denoting data and odd fields denoting type. That would be an alternative that also works with R and Excel.

ADD REPLYlink written 8.1 years ago by Joachim2.8k

Adds a load of noise and has no exising parsers, though.

ADD REPLYlink written 8.1 years ago by Casbon3.2k
3
gravatar for Istvan Albert
8.1 years ago by
Istvan Albert ♦♦ 79k
University Park, USA
Istvan Albert ♦♦ 79k wrote:

I would agree that for attribute representation the json encoding would be far superior to the current GFF syntax.

Now there is one overall drawback of formats, most tools start to assume that you always need to parse the entire line to find out what it is stored in it. This can be greatly counterproductive - it is akin to GFF parsers that insist of parsing and interpreting the entire (possibly very lengthy) attribute even when all one needs is to select all starts for a given chromosome. Thus you end up with code that is orders of magnitude slower than a three line python code you could write yourself.

ADD COMMENTlink modified 8.1 years ago • written 8.1 years ago by Istvan Albert ♦♦ 79k

It would be easy to write a parser that doesn't need to parse the whole file. You take a "conventional" (non streaming) JSON parser, and write a parser that, IF the root element is an array, read each element of the array as a string, then feed it to a conventional (non streaming) parser. This such a parser would only be a few lines of code (since it reuses another JSON parser), and you get a streaming/gradual parser.

ADD REPLYlink written 3.3 years ago by maxime.levesque0
2
gravatar for iw9oel_ad
8.1 years ago by
iw9oel_ad6.0k
iw9oel_ad6.0k wrote:

I would argue that delimited files only appear useful for manipulation and exchange unless they express no more than the simplest possible data.

I don't like to see compound values in tab-delimited files because compound values indicate that the data are hierarchical. I think that it's cleaner and simpler to use a uniform hierarchical representation in such cases. Having to special-case the top level actively discourages the natural recursive approach. A regularly structured data stream and an event-driven parser fills me with peace and joy. The alternatives lead to rage and hate.

ADD COMMENTlink modified 8.1 years ago • written 8.1 years ago by iw9oel_ad6.0k

I agree that compound values in delimited files suck, but it looks like all the formats that we have rely on them.

ADD REPLYlink written 8.1 years ago by Casbon3.2k
2
gravatar for Erik Garrison
7.2 years ago by
Erik Garrison2.1k
Somerville, MA
Erik Garrison2.1k wrote:

I attempted to design a JSON schema for encoding the same class of data as resides in VCF, https://github.com/ekg/jvcf.

Reading it now with several years of experience to reflect on, I don't see it as a completely correct design. Hopefully, sharing it here can motivate thought on exactly the point that the OP raises.

ADD COMMENTlink written 7.2 years ago by Erik Garrison2.1k
1
gravatar for Mitch Skinner
8.1 years ago by
Mitch Skinner660
Emeryville, CA
Mitch Skinner660 wrote:

I'm a big fan of JSON, but I'm not sure that we need a new hybrid format.

You say that plain JSON isn't enough because you can't grep it or use awk, but I think that's just an argument to create a tool that can process JSON in a streaming way with a simple mechanism for pulling out subsets (grep for JSON) and doing basic manipulations on them (awk for JSON).

One of the things I really like about JSON is that you no longer have to worry about underspecified encoding issues (like: what happens when your data contains embedded tabs and newlines?). But a hybrid like you propose would require a separate escaping mechanism, and then you're starting to edge toward a whole new format of your own; I'm not sure that a well-defined hybrid would be any easier to handle than JSON itself. For example, your example parsing code is incomplete because it doesn't unescape embedded tabs/newlines.

Plus, if we had an awk/grep-alike for JSON, then (since it would understand JSON syntax) it could be easier to use than using awk and grep on this hybrid you're proposing (I'm trying to imagine grepping for nested JSON structure on such a hybrid, and at first blush it seems complex).

ADD COMMENTlink written 8.1 years ago by Mitch Skinner660

I'm not really proposing a new data format, but observing that we have all these delimited exchange formats that have no spec for coding. Why not just agree on the JSON spec for cell coding, since it is so unobtrusive?

ADD REPLYlink written 8.1 years ago by Casbon3.2k
0
gravatar for Neilfws
8.1 years ago by
Neilfws48k
Sydney, Australia
Neilfws48k wrote:

I like JSON a lot - as a data interchange format, for fetching data from a RESTful web service, parsing into a hash, storing in a database. I'm not convinced that it makes a good general, text document format.

There are those who say "JSON for interchange, XML for documents." And those who argue for one, the other or all points inbetween. Try "JSON versus XML" as a Google search query, for the arguments (and some entertainment).

ADD COMMENTlink written 8.1 years ago by Neilfws48k

I'm making a more nuanced point than just use JSON. i.e. using JSON embedded in delimited files.

ADD REPLYlink written 8.1 years ago by Casbon3.2k

Right. Well, why not, it could work. Although in a way, my point about JSON as text versus JSON as an "ephemeral" interchange between server and client still hold.

ADD REPLYlink written 8.1 years ago by Neilfws48k
0
gravatar for lh3
8.1 years ago by
lh331k
United States
lh331k wrote:

Using Json brings a lot of troubles to other programming languages. The standard C/C++ library does not come with a Json parser; neither Perl (also true for Ruby?). If you use a 3rd-party parser, you get a library dependency; if you implement it by yourself, it is not trivial. I see this is the major problem.

Also, quotation marks will make AWK less effective. With awk, you need to test:

awk '$1=="\"Foo\""'

As to VCF, I agree VCF should use one delimiter rather than both ";" and ":", but doing that only makes VCF look cleaner. No real effect.

EDIT: responses to the comments:

  1. Python is not perfect. I buy that it is better than Perl overall, but there are areas where Perl does better especially in Bioinformatics. Also, by design python is not the most elegant language. It is tens of times slower than a couple of other scripting languages, too.

  2. For C/C++ programming, library dependency is a big concern. You use C/C++ because you care about efficiency, but my experiences keep telling me: never trust the coding quality of 3rd-party libraries. Look at the XML parsers. There are many, but only libxml seems to implement the right way. Who knows the quality of these 12+6 json libraries? I bet most do not strive for efficiency. Another problem is licensing.

  3. Most scientific C/C++ software do not use encoding libraries because that leads to nothing but distraction.

ADD COMMENTlink modified 8.1 years ago • written 8.1 years ago by lh331k
2

The best way to prove JSON is better than GFF/VCF is to get a JSON-based GFF/VCF widely accepted by the community. Replacing GFF may be too late, but replacing VCF is possible. Nonetheless, you should beware that nearly all widely used formats were designed by the most influential people in Bioinformatics. This at least means something...

ADD REPLYlink written 8.1 years ago by lh331k
1

12 libs for C, 6 for C++, 1 for perl http://www.json.org/

But look, people are using encodings so why not encourage one with a spec?

ADD REPLYlink written 8.1 years ago by Casbon3.2k
1

Plus if you're using perl you have a bigger problem than a lack of a JSON parser ;)

ADD REPLYlink written 8.1 years ago by Casbon3.2k
1

Influential doesn't always meen correct. The most influential people I've met in bioinformatics tend to be biologists first, and software engineers a VERY distant second.

ADD REPLYlink written 7.6 years ago by mylons130
1

@desaila: we are discussing engineering problems here and of course I was mainly talking in the engineering aspect. It is true that the most influential people in general have a very strong biology background. You comment reminds me of my only negatively voted answer. I am still right: VCF is becoming the standard and JSON is largely ignored in NGS data processing. Also, I wrote more C/C++ programs than >99% of people in this forum. I am fairly certain of my judgment of what is a good C/C++ program.

ADD REPLYlink written 7.6 years ago by lh331k

There is a JSON gem for Ruby: gem install json

Just saying. :)

ADD REPLYlink written 8.1 years ago by Joachim2.8k

Only kidding about perl, but... it seems like you hate dependencies, but wouldn't you prefer it if people stuck to a widely accepted spec for encoding values? The GFF3/VCF one IS worse than JSON.

ADD REPLYlink written 8.1 years ago by Casbon3.2k

@lh3 Why not using JSON at least for the INFO column in the VCF ?

ADD REPLYlink written 7.2 years ago by Pierre Lindenbaum118k

@Pierre an unnecessary library dependency alone will push me away.

ADD REPLYlink written 7.2 years ago by lh331k

@Pierre an unnecessary library dependency alone is enough to push me away. Sorry, probably I am overacting, but this is how I write programs.

ADD REPLYlink written 7.2 years ago by lh331k

I do understand your point of view concerning the dependencies. But, for the INFO column, instead of writing "DP=3;AF1=0.5716" the tools generating VCF, including yours, could write '{"DP":3,"AF":0.5716}'. This would allow to put some structured data for the annotation (gene, more than one mRNA }. Then for the analysis/parsing, people would use what they want. But may be biostar is not the best place to discuss this.

ADD REPLYlink written 7.2 years ago by Pierre Lindenbaum118k

... and ok, it's not easy to parse JSON with awk, etc...

ADD REPLYlink written 7.2 years ago by Pierre Lindenbaum118k

... but could we add an option in bcftools to print JSON for INFO ? would you accept a pull request ?

ADD REPLYlink written 7.2 years ago by Pierre Lindenbaum118k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1105 users visited in the last hour