Write out Unicode in Octal

This is just a brief note about something that I came to realise recently. I've been working a lot with UTF-8 byte streams, Unicode characters, and all that, and I've come to realise that writing them out in hexadecimal is completely wrong. It's a form of obfuscation.

I'll walk through an example, which should explain very quickly why I feel we should be using octal. Take the Unicode codepoint U+2B776, which has this UTF-8 encoding in hexadecimal:

0x2B776 F0 AB 9D B6

Obviously, right? This is the same thing in octal:

0533566 360 253 235 266

The first octal digit of a leading byte is 3, of a continuation byte is 2, and of an ASCII byte is 1 or 0. We only really need to worry about the other digits. Now, compare the other digits of each continuation byte with the octal of the codepoint.

0533566 360 2 53 2 35 2 66

Well, that's pretty straightforward. This is because the lower six bits of a UTF-8 byte contain the whole of the value contributed to the codepoint, and each octal digit represents three bits, so two octal digits hold this value exactly.

What about the leading byte? Okay, that's slightly trickier, although still easier than it would be in hexadecimal. As well as contributing a few bits to the codepoint, a leading byte also indicates how many bytes are expected to be in this codepoint (the others of which will be continuation bytes).

So here are the rules, given a leading byte 3xx(in octal):

If the middle digit is < 4 (i.e. 0 , 1 , 2 , 3 ) then we count it, as we would in a continuation byte.
Otherwise (i.e. 4 , 5 , 6 ) we count only the last digit, but if the middle digit is 5 (i.e. is odd) then a 1 comes first.

You can often get away without knowing what the length of the sequence should be, but if you have to know then just look at the middle digit: 0-3 is 2-byte, 4-5 is 3-byte, and 6 is 4-byte.

With a little practice these are easy to learn. Much easier than learning the hexadecimal ones. Here are some examples, taken from Wikipedia (where they're written out in hexadecimal, making them a lot harder to read).

044 044 0242 3 02 2 42 020254 34 2 2 02 2 54 0201510 36 0 2 20 2 15 2 10

Edit— By request, here is an extra example of 35x .

0120254 35 2 2 02 2 54

I can now quite happily read UTF-8 sequences as Unicode codepoints in my head (something I had to do a lot of while working on a UTF-8 library), and with a little thought can write out codepoints in UTF-8 too. But only in octal! If I had to do it in hexadecimal then I wouldn't have a clue.

So please, to make things easier on us humans, write out Unicode in octal .

Write out Unicode in Octal – Lubutu

Write out Unicode in Octal

Recommend

手摸手，打造属于自己的 React 组件库03 — 打包篇

Java基础系列1：深入理解Java数据类型

解锁Apache Hudi删除记录新姿势

Pytorch [Basics] — Intro to Dataloaders and Loss Functions

The Embedded Working Group Newsletter - 23

配置基于Traefik v2的 Web 服务器

大V付费指导，这件事情你们怎么看

官方：2月1日起购买火车票须提供乘车人手机号码

湖北红十字会：深刻检讨对直接责任人依纪依规追责

Prometheus 常用函数 histogram_quantile 的若干“反直觉”问题

About Joyk