It's often said that XML is very verbose and therefore JSON is better. I wanted to challenge that assumption and find the smallest way to represent any JSON value using XML.
Constants |
|
<t/> |
0 |
|
<f/> |
- 1 |
|
<l/> |
0 |
Number |
|
<n/> |
+ 1 |
|
|
+ 6 |
String
|
|
<s/> |
+ 2 |
|
|
+ 4 |
Array |
|
<a/> |
+ 2 |
|
<a>
<n>1</n>
<s>two</s>
<f></f>
</a> |
<a>
<n>1</n>
<s>two</s>
<f></f>
</a>
|
+ 4 - n |
Object |
|
<o/> |
+ 2 |
{
"first": 1,
"second": "two",
"third": false
} |
{
"first": 1,
"second": "two",
"third": false
}
|
<o>
<n k="first">1</n>
<s k="second">two</s>
<f k="third"></f>
</o> |
<o>
<n k="first">1</n>
<s k="second">two</s>
<f k="third"></f>
</o>
|
+ 5 + n |
As you can see, the XML counter part is a bit more verbose and less readable because of the way syntax highlighting is setup. However, while it is bigger, it isn't out of proportion bigger. The structure can be at most twice as big.
Implementation
In order to implement it, I decided to go with the same API as JSON:
- XSON.stringify(object, formatter, space)
- XSON.parse(string)
You can play with it on this current page or can check it out on GitHub.
The implementation was more straightforward than I expected thanks to the fact that there's a XML Parser inside browsers. However, I had to deal with nasty encoding issues 🙁
String encoding
There are some characters such as <
and \0
that we want to escape because otherwise they are likely to be problematic while parsing the XML. The way to encode those characters in XML is to use the number;
notation where number is the character code. For example a
is represented by a
:
> new DOMParser().parseFromString('<a>a</a>', 'text/xml')
.getElementsByTagName('a')[0].textContent
"a" |
> new DOMParser().parseFromString('<a>a</a>', 'text/xml')
.getElementsByTagName('a')[0].textContent
"a"
Unfortunately, you cannot express all the characters with this notation. The XML specification introduces Restricted Characters in the following ranges: [#x1-#x8]
, [#xB-#xC]
, [#xE-#x1F]
, [#x7F-#x84]
and [#x86-#x9F]
. When you try to read those characters, then the XML parser generates an error.
> new DOMParser().parseFromString('<a>�</a>', 'text/xml')
.getElementsByTagName('a')[0].textContent
"error on line 1 at column 8: xmlParseCharRef: invalid xmlChar value 0" |
> new DOMParser().parseFromString('<a>�</a>', 'text/xml')
.getElementsByTagName('a')[0].textContent
"error on line 1 at column 8: xmlParseCharRef: invalid xmlChar value 0"
Instead of fighting with the XML spec, I decided to use my own encoding. I replace the character by \u0000
. Where the number is an hexadecimal representation of the number padding so it has exactly four digits.
To do that, we need first to escape all the \
and can use a regex to do it in few lines of code 🙂
function encode(str) {
return str
.replace(/\\/g, '\\\\')
.replace(/[\u0000-\u0008\u000b-\u001f&<>"\n\t]/g, function(c) {
var hex = c.charCodeAt(0).toString(16);
while (hex.length < 4) {
hex = '0' + hex;
}
return '\\u' + hex;
});
} |
function encode(str) {
return str
.replace(/\\/g, '\\\\')
.replace(/[\u0000-\u0008\u000b-\u001f&<>"\n\t]/g, function(c) {
var hex = c.charCodeAt(0).toString(16);
while (hex.length < 4) {
hex = '0' + hex;
}
return '\\u' + hex;
});
}
Then, in order to decode it, we do the opposite: we first decode all the unicode characters and remove the escapes. In order to make sure that the unicode character was not escape, I'm using a small trick. You can count the number of \
. If it's an even number, then it is not escaped, otherwise it is escaped!
function decode(str) {
return str
.replace(/(\\*)\\u([0-9a-f]{4})/g, function(match, backslash, n) {
if (backslash.length % 2 !== 0) {
return match;
}
return backslash + String.fromCharCode(parseInt(n, 16));
})
.replace(/\\\\/g, '\\');
} |
function decode(str) {
return str
.replace(/(\\*)\\u([0-9a-f]{4})/g, function(match, backslash, n) {
if (backslash.length % 2 !== 0) {
return match;
}
return backslash + String.fromCharCode(parseInt(n, 16));
})
.replace(/\\\\/g, '\\');
}