I'm now in a group full of compiler engineers at Facebook and learning a lot. Yesterday, I read a post by David Detlefs about how to efficiently encode strings for concatenation and since it's very clever I figured I would share it.

Problem

A lot of programs are taking a string as input and building a string as output. You can imagine the following code. Note: I'm going to use JavaScript as an example but it applies to almost all the languages out there.

var str = '\n';
for (var elem of elems) {
  str += ' * ';
  if (elem.isExpired) {
    str += '[expired] ';
  }
  str += elem.name + '\n';
}

An example output might be

'
 * Nutella
 * Eggs
 * [expired] Milk
'

What is being executed is

'\n' + ' * ' + 'Nutella' + '\n' + ' * ' + 'Eggs' + '\n' + ' * ' + '[expired] ' + 'Milk' + '\n'

If you implement this naively, the execution would look something like:

'\n' + ' * ' = '\n * '
'\n * ' + 'Nutella' = '\n * Nutella'
'\n * Nutella' + '\n' = '\n * Nutella\n'
'\n * Nutella\n' + ' * ' = '\n * Nutella\n * '
'\n * Nutella\n * ' + 'Eggs' = '\n * Nutella\n * Eggs'
'\n * Nutella\n * Eggs' + '\n' = '\n * Nutella\n * Eggs\n'
'\n * Nutella\n * Eggs\n' + ' * ' = '\n * Nutella\n * Eggs\n * '
'\n * Nutella\n * Eggs\n * ' + '[expired] ' = '\n * Nutella\n * Eggs\n * [expired] '
'\n * Nutella\n * Eggs\n * [expired] ' + 'Milk' = '\n * Nutella\n * Eggs\n * [expired] Milk'
'\n * Nutella\n * Eggs\n * [expired] Milk' + '\n' = '\n * Nutella\n * Eggs\n * [expired] Milk\n'

Because strings are immutable, we need to do a full copy of the string for every small concatenation. In practice this turn a O(n) algorithm into O(n²).

Solution 1: Change the code

If this becomes a bottleneck, the usual solution is to implement something call a StringBuilder pattern. Instead of using a string all the way through, you can use an array and push all the string pieces to it. Once you are done building the result, you can join all the pieces together into the final string. Since at this point you know all the strings the operation can count all the sizes and allocate exactly the right size.

var buffer = ['\n'];
for (var elem of elems) {
  buffer.push(' * ');
  if (elem.isExpired) {
    buffer.push('[expired] ');
  }
  buffer.push(elem.name, '\n');
}
str = buffer.join('')

This pattern works to solve the problem but requires the programmer to know about it and the performance to be bad enough that it is worth writing code in a different way. In practice, a lot of code is not written that way and it's unclear that any amount of education will change this fact.

Broken Solution 2: Mutate the original string

The solution that comes to mind is: can normal strings act as a buffer?

The idea is that you allocate a buffer of characters and whenever you do a concatenation, you keep writing at the end of the buffer. If it isn't big enough, you allocate a bigger one, do a single copy and keep going.

var str = '\n';   // size = 1, ['\n', _, _, _, _, _, _, _]
str += ' * ';     // size = 4, ['\n', ' ', '*', ' ', _, _, _, _]
 
// The next one doesn't fit so we need to alloc a new buffer and do a full copy
str += 'Nutella'; // size = 11, ['\n', ' ', '*', ' ', 'N', 'u', 't', 'e', 'l', 'l', 'a', _, _, _, _, _]
str += '\n';      // size = 12, ['\n', ' ', '*', ' ', 'N', 'u', 't', 'e', 'l', 'l', 'a', '\n', _, _, _, _]

Performance-wise, this is what we want, but there's one problem...

Aliasing

You can assign the string to a variable and assign another variable with that variable. For example:

var str = '\n';
var str2 = str; // <-- here we make an alias
<pre>
 
In this case, both <code>str</code> and <code>str2</code> are pointing to the same <code>'\n'</code> string. In the compiler literature this is called aliasing.
 
The big question is what happens if you try to update one of the variable:
 
<pre lang="javascript">
str += ' * ';

If you look at the JavaScript specification, strings are immutable meaning that you expect str2 to be unchanged but str to be:

str2 == '\n'
str  == '\n * '

Unfortunately, if you mutate the string like in the above solution, then both of them would be '\n * because they both point to the same underlying storage.

Solution 3: Linear Types

If you've not been living under a rock, you probably have heard about Rust and linear types. This is a fancy name to say that you cannot have aliasing: there's only a single variable that can point to a value at all time.

What this means in this case is that the line var str2 = str; would be illegal. If you want to do that, you need to do a full copy of the value so it's effectively a different one.

In practice, aliasing happens all the time in normal programs, for example calling a function with a string as argument is a form of aliasing. We wouldn't want to do full copies every time aliasing is happening.

Rust is getting away with it using a concept calling "borrowing" where you can create an alias if the compiler can guarantee that the previous variable cannot be accessed during the lifetime that the alias exists.

In my understanding, you need a strong type system in order to properly enforce those guarantees and in dynamic languages like JavaScript you would have to be too pessimistic and do way more copies than necessary when you are just passing the variable around, ruining the wins you get from building the string in the first place.

Solution 4: Size in the variable

Aliasing is usually a dealbreaker because you can mutate the underlying storage that another variable could observe. But in this particular case we can exploit the fact that the only mutation we care about is appending something at the end.

So, in the variable we not only keep a pointer to the buffer but also the size we care about. If someone else appends something at the end, it will not affect us because what's in the buffer for that size didn't change.

var str = '\n';
// buffer1, size = 1, ['\n', _, _, _, _, _, _, _]
// str : size = 1, buffer = buffer1
 
var str2 = str;
// str2: size = 1, buffer = buffer1
 
str += ' * ';
// buffer1, size = 4, ['\n', ' ', '*', ' ', _, _, _, _]
// str : size = 4, buffer = buffer1

At this point, str2 points to buffer1 with '\n * ' but because it has size = 1 then we know it really is '\n' as intended.

The only edge case to consider is if you are trying to also concatenate str2. If the size of the variable is not equal to the size of the underlying buffer, this means that someone else clobbered the buffer. In this case, our only option is to do a full copy.

str2 += '|';
// buffer2, size = 8, ['\n', '|', _, _, _, _, _, _]
// str2: size = 2, buffer = buffer2

Conclusion

Before joining the team, I knew about the string builder pattern but I had no idea that there was so much theory behind this particular problem like aliasing, linear types... I hope that explaining those concepts in terms of JavaScript is helpful to get some insights into what's happening inside of compilers.

If you liked this article, you might be interested in my Twitter feed as well.
 
 

Related Posts

  • October 5, 2011 Javascript Presentation (1)
    The talk is over. Check out the Slides & Video. For several months now I've been surveying my friends and teachers at EPITA and I came to the conclusion that they have absolutly no idea what Javascript really is. In order to help them discover a language that is getting a lot of […]
  • November 4, 2013 Bitwise Truthiness (0)
    In this blog post, I explore another form of truthiness in Javascript. What happens if you use a bitwise operator on a value like 0|value or ~~value. Context We recently turned on the JSHint bitwise rule by default and the following code was caught. var isValid = false; for […]
  • August 23, 2011 Javascript – Hook Technique (5)
    Let's go back 5 years ago during the World of Warcraft beta. I was working on Cosmos UI, a projects that aimed to improve the World of Warcraft interface. As interface modification was not officially supported by Blizzard, we went ahead and directly modify the game files written in […]
  • October 28, 2011 JSPP – Morph C++ Into Javascript (Paper) (0)
    6 months ago, I wrote the blog post "JSPP - Morph C++ into Javascript". My supervisor at the LRDE (R&D Lab of EPITA), Didier Verna, found it interesting and told me that it could be worth a publication. With his great help, I've written my first article. We have submitted it to two […]
  • April 5, 2013 Javascript – Private methods are not really private (3)
    In 2001, Douglas Crockford evangelized private members design pattern and since then has been widely adopted by the Javascript community. However, are they really private? Javascript is an extremely dynamic language in which it is possible to access a lot of things on way or another. […]