Many bioinformatic files are delimited because delimited files are so useful for manipulation and exchange. However, in order to be flexible these delimited files contain a column which can contain a property list. For example, GFF3 says that the attributes column is:
A list of feature attributes in the format tag=value. Multiple tag=value pairs are separated by semicolons. URL escaping rules are used for tags or values containing the following characters: ",=;". Spaces are allowed in this field, but tabs must be replaced with the %09 URL escape.
VCF uses the GFF3 style for the INFO field and also has a FORMAT field which describes the colon separated calls. So it has two ad-hoc embeddings in one file format.
But another big problem is that the data is untyped, requiring constant int, float, string manipulations and checking for null values. For example, '.' is often used to repesent null values.
Would our lives be simpler if we just agreed that all these entries in a delimited file should be JSON? i.e. that each entry in a tab delimited format is valid JSON. This is not a 'just use JSON' statement, but saying use JSON as a sane, backward compatible, encoding for delimited data.
If we were to stick to this convention, we would get type information and property lists no problem. Here are some examples of valid JSON:
- 1 - the number one
- "foo" - the string 'foo'
- [1, "a"] - a list with the number one followed by the string 'a'
- {"eggs": false, "spam": true} - a property list (or map, dictionary or whatever you call them) with 'spam' set to true and 'eggs' set to false
- 1e-10 - a small number
So I might have a file that looks like this:
name Age info
"Foo" 32 {'k': 'v', 'k2': 1e-10}
A simple parser would then look like this:
for line in file:
yield map(json.loads, line.split('\t'))
(Although clearly we need to encode tabs carefully when writing the file).
I can't believe how many crappy, half specificied formats I've dealt with. So my question is: is this a good idea and should we try and convince GFF/VCF/etc to embrace this in their specs?
EDIT: make it clearer this is not just use JSON, but a new kind of delimited JSON.
I think your supposed to +1 the question ;)
So, instead of sticking to well-known formats like CSV and JSON, you are proposing to mix those two together? That immediately means that you cannot use a CSV or JSON library on its own anymore, but that you have to mingle them together instead. I would say use JSON if you have to deal with structured data that will not fit CSV easily, and instead of introducing delimiters, just use a JSON array. To read-in that file in a program, you then only need only line and your whole data structure is in memory.
Yes, please. JSON and YAML are awesome formats; plenty of parsers, human readable and concise. Not sure how to answer your question, so how about just a +1.
I'll take it that's a no ;)
The point is that its backward compatible. I mean csv libraries dont handle type, so you need to layer that over the top anyway.
The reason people don't just use JSON or XML is that: you want to be able to use AWK you want to be able to use excel * you want to be able to use grep
i.e. you need a delimited file. I am trying to suggest a decent encoding for a delimited file.
I get your point, but I would still not mix up formats. If you need a type for your data, stick to CSV/TSV and add a 'datatype' prefix/postfix. For example: Hello^STRING, because then you can still grep/awk easily, you can access the datatype smoothly too, and just pick a symbol as a separator ("^" here) that does not appear as datavalue. If you insist on using Excel, you can even have one field data, next field its datatype, next field data again, then next field type, etc.
So, you're saying it's better to use a prefix type? That doesn't even demonstrate backward compatibility and would produce a load of nonsense loading into R/excel
That is why I said you can also have even fields denoting data and odd fields denoting type. That would be an alternative that also works with R and Excel.
Adds a load of noise and has no exising parsers, though.