The Tragedy of UCS-2

Apropos of nothing, I'd like to tell you a tale. It's not an original tale, but it's one of my favorites. It's a true tale about standardization, computers, the Unicode consortium, and a few companies.

I'm far from the first person to write about it. However, it's been on my mind lately, and I keep restating it, so I'm formalizing it here in my blog.

A long time ago, at least in computer time, in the far-flung era of 1989, the Unicode working group was really starting to get going. They were hawking their brand new universal character set, one that would end the code page mess once and for all, making characters show up as the same on every computer, regardless of its language or operating system.

This time is important because this is when Sun Microsystems, Microsoft, and NeXT (whose NeXTSTEP was later used as the basis for Mac OS X) got on the Unicode bandwagon.

Unicode was a wonderful vision of the future where every computer works great together. They wrote on holy parchment of a character set, encoded in 16 bits, that would be plenty of room to encode every script in modern use, and then some.

And so, Sun, Microsoft, and NeXT built this new universal character set into their then up-and-coming software APIs. Sun with Java, Microsoft with Win32, and NeXT, with, well, NeXTSTEP. Strings were no longer collections of 8-bit, or heaven forbid, 7-bit characters. No, they were now 16-bit, so they could encode every script and bring this vision of the future to fruition.

You see, this future vision was so great because 16 bits is only double the size of 8 bits, so it was an easy sell to become compatible with every language in the world by simply doubling your character size. It's not that much of a loss of space; it's worth it!

It was not meant to be

If you're reading this in modern times, it's likely you know this is not what happened.

Around 1996, the Unicode consortium decided that no, 16 bits was not enough. They had decided that Unicode should not encode all modern scripts in use today, but instead every script to have ever been created by humans . However, all these products had already codified 16-bit characters. They couldn't simply unship their products and go back to 8-bit characters and use Plan9's UTF-8.

So... they made a compromise. That compromise is UTF-16.

You see, these programs were made with the expectation 16 bits could cover every character in Unicode, 1:1, without translation. 65536 characters, after all, is a ton! This "encoding", or lack thereof, is called UCS-2. The U niversal C haracter S et, represented using 2 bytes, without transformation.

UTF-8, and UTF-16, are U nicode T ransformation F ormats, in 8 and 16 bits, respectively. They represent a way of transforming the UCS to fit in various character sizes.

UCS-2 can only represent the first 65536 characters of Unicode, also known as the Basic Multilingual Plane .

A tangent: Planes and blocks

Unicode is split up into planes , and those planes are further split up into blocks .

A block is 256 codepoints. A plane is 256 blocks. Therefore, a plane is 65536 codepoints.

Planes further than the Basic Multilingual Plane, or BMP, are referred to as supplementary planes . The first of which is, fittingly, called the Supplementary Multilingual Plane .

Supplementary planes, especially those containing high-codepoint Private Use Area characters or emojis, are sometimes referred to as Astral planes to jokingly represent their sometimes poor support in programs. Programs, almost invariably, originally designed to use UCS-2.

Right then. Back to what we were talking about

So, these software developers were in a pickle. UCS-2 wasn't enough for all of Unicode anymore, but they can't just stop using it after all these third-party developers started using their APIs.

As said, the compromise solution here was UTF-16. UTF-8, as you may know, encodes arbitrary Unicode codepoints with 1 to 4 bytes, by using specially assigned values to represent pieces of larger values. These pieces are put together by a decoder to create a Unicode (or UCS) codepoint.

To put this plan into action, a large portion of the BMP was dedicated to so called surrogate pairs - fake, non-characters that represent half of a supplementary plane Unicode codepoint. Two UTF-16 characters, a "surrogate pair", consisting of two "surrogates" of opposite halves, in the correct order, represent a character in a supplementary plane, all the way up to U+10FFFF.

The reason they encoded these non-characters into Unicode instead of making them a quirk of the transformation format, like UTF-8, is that these programs were built for UCS-2; they expected each char to be one codepoint . So, by encoding these codepoints, this wasn't wrong , and programs could continue to assume UCS-2 without breaking too spectacularly.

However, this ruined the point of UCS-2. We were back where we started; one character is no longer one character, as UCS-2 promised. So we're stuck with double the character size, and we still need to carefully manipulate our strings and decode/encode them.

This is the worst of both worlds.

What about the dream of a fixed character size?

Fixed character sizes live on in the usually-hilariously-impractical UCS-4. A 32-bit (4 byte) mapping of Unicode. (I've used it a few times when I'm too lazy to fumble with UTF-16 or UTF-8.)

It wastes immense amounts of memory, weighing in at, you guessed it, 4 bytes per character. However, it can encode every single Unicode codepoint, from U+000000 to U+10FFFF. (You may notice that's only 21 bits; the remaining 11 bits in UCS-4 are always zero, but few machines have a 24-bit integer type, so that precludes a UCS-3 from existing.)

The aftermath

Win32

Windows, being such a popular platform, and therefore having a lot of... not very good developers, making software for it, has become somewhat well-known for having filenames and other such things with malformed UTF-16, such as a surrogate pair in the wrong order, or only one half of a surrogate pair, due to naive string manipulation code that expects UCS-2, not UTF-16. This has resulted in the creation of WTF-8, which stands for Wobbly Transformation Format, not what you thought it stood for. Sheesh.

WTF-8 is UTF-8, but capable of encoding malformed UTF-16 surrogates, to faithfully reproduce these broken strings on systems that use UTF-8.

Java

Java uses lots of strings. I mean, what doesn't use lots of strings? They're pretty essential.

UTF-16's space wasting has become such a problem that the JVM is gaining the ability to transparently represent strings in memory as Latin-1 , a legacy codepage, if possible.

JavaScript inherits Java's UTF-16 conundrum since it faithfully steal- er, borrows , many things from Java.

Wait, what about Unix/Linux?

Unix and Linux conveniently sidestepped this whole issue by just shrugging their shoulders and going "eh, a char has no defined encoding, it's just a number.".

Everyone just kinda went with it and UTF-8 was slowly adopted. However, this means your program might emit mojibake if it's fed a legacy codepage and you didn't check an LC_* environment variable to be sure.

And so...

That's a somewhat dramatic and fact-checked retelling of what I've apparently decided to call The Tragedy of UCS-2 . I hope it was... interesting?

It was not meant to be

A tangent: Planes and blocks

Right then. Back to what we were talking about

What about the dream of a fixed character size?

The aftermath

Win32

Java

Wait, what about Unix/Linux?

And so...

Recommend

后端架构设计，如何扛住100亿次请求？

Why do developers at Google consider Agile development to be nonsense? (2016)

From 30 to 230 docker container per host

Exploring the Visible Spectrum in Python

Peter Thiel 在《纽约时报》发表反谷歌檄文

Vim-Plug: Run PlugUpdate every week automatically · GitHub

Hacker News 疑似屏蔽

中国古代引进新作物多长时间能全国普遍种植？ - 知乎

CapitalOne和GitHub因数据泄露事件遭遇集体诉讼

More on Backdooring (or Not) WhatsApp - Schneier on Security

About Joyk