On Common Misuse of the File Formats

A small rant about text file format holy wars. I believe that most of the file formats that participate in these holy wars are rather good and the most of the frustration comes from their misuse.

  • XML is a DOCUMENT language

  • JSON is a SERIALIZATION language

  • TOML and YAML are CONFIGURATION languages. TOML vs YAML holy war is out of scope of this post

XML is an awesome format for extensible documents. Like, you cannot do (X)HTML or SVG with JSON, also it's namespacing features allow you to embed one type of document into another. However it's a terrible serialization format, that's why XMPP seems so weird and uses partially invalid XML. Just think of <stream:stream> in the beginning that's never closed. Terrible. It also has no place in RPC-like protocols. SOAP for example is another terrible application for XML.

JSON is an awesome human-inspectable serialization language. It's good for network intercommunication and maybe for data exporting. It's strict and standartized and readable at least in its "pretty" form. However, it's a terrible configuration language due to its extremely strict syntax and lack of comments. composer.json and package.json are terrible applications for JSON, but, interestingly, composer.lock and package-lock.json are not. Pipenv can be an example of the correct usage. It uses TOML for the config and JSON for the lock file.

I want to return to the comparison of TOML and YAML in another post, so let's go directly to the bad parts. Just try to serialize a random dictionary with TOML and look at the result. It will be a mess. Also many new features introduced in TOML 0.5, like dotted keys, make sense only when the config is written by a human. TOML is not too popular yet to have any known bad applications so let's get to YAML. YAML is extremely insecure to be used as a data exchange format for untrusted data. It also wastes a lot of space for indentation spaces. Ironically, Rails collected both misuses for YAML by using it both as a parsable language for the POST body (in past) and a default serialization language for complex fields in the database. Also, YAML's short json-like syntax is also made solely for a human decision making.

Of course all these applications are not isolated and often there are options depending on the point of view, like RSS/Atom and JSON Feed both making sense depending on whether you see it as an alternative representation of a website as a document or a stream in a serialized form or some sort of REST response.

Well, there are no solutions, only trade-offs. Choose your languages wisely.


Comments powered by Disqus