Unicode

Tags: haskell technical

A little background, or, How did we reach here

In chapter 5 of Real World Haskell, the implementation is to create a JSON library. Everything goes fine until the authors start with something called “Astral” representation of characters. The general consensus of the readers is that the authors have gone from 0 to 120 in like 2-3 code blocks (“The author shows his muscles in this chapter. Not funny at all.” someone concluded in this funny statement), and truth be told I was one of the people who had this idea.

Shielded away from the guts of how the encodings really work, because… libraries, I realised I had to take a detour in order to understand what exactly was going on in those code blocks in order to implement a library myself.

So the problem arose because JSON (and JavaScript till ES5) does not uses the now ubiquitous UTF-8 encoding to… encode characters, but something called UCS-2, which is a 2byte (16bits) fixed length encoding, which basically means that the characters are encoded in a uniform fashion and therefore the encoding can only show \(2^{16}\) or 65536 characters. If you include the characters we are using these days (Emojis ๐Ÿ˜‰ and symbols โš”), then 65536 is too low a number.

Enter Unicode

Unicode is basically a mapping between a number called “code point” and the character, e.g. U+20AC is the code point for the Euro Symbol . The Unicode standard provides 1,112,064 valid code points, and values from U+0000 to U+10FFFF, which encode all the symbols that we have around from languages, music, maths, sciences, etc. while still leaving enough space for more symbols that we might make up. It is because of this huge number that we had hit a wall while using the fixed length encoding UCS-2.

The standard divides the characters in 17 planes, each with 65534 characters. The first plane, Plane 0, is called Basic Multilingual Plane (BMP) which “contains characters for almost all modern languages, and a large number of symbols.”1. The others above it are (jokingly) called Astral Planes, and like our universe, most of these are pretty empty.

UCS-2 can show all the characters of BMP. But not the symbols from the Astrals.

The standard provides 3 ways of encoding all the characters: UTF-8, UTF-16, and UTF-32. Out of these, the first two are “variable length” and the last is fixed length (like UCS-2).

UTF-32

We can quickly end the discussion of UTF-32 by stating that each character in this encoding takes 32bits. Therefore, we have the option of displaying \(2^{32}\) characters, which is a lot more than the number of code points we really have.

The remaining two are more interesting encodings. Let’s go over them:

UTF-8

This is a variable length encoding, and 8 in the name signifies that we use 1, 2, 3, or 4 8bit bytes to encode the characters. Therefore the size of each character may range from 1byte to 4bytes. Each character has a leading bit, which is the first bit of the encoding.

0 Leading Bit

0 as the leading bit means that the encoding is storing an ASCII character: between U+0000 to U+007F. This takes a single byte of the form: 0xxxxxxx. So we have 7bits for the data, enough to show \(2^7\) or \(127\) characters in ASCII, which makes UTF-8 backwards compatible, and is one of the reasons of its rise.

1 Leading Bit

1 as the leading bit means we are using variable number of bytes, and the number of =1=s denote the number of bytes being used for encoding the current character.

Since we are already using 1byte for ASCII characters, we can only use 2, 3, or 4 bytes for the next set of characters.

2 Bytes

This encodes the characters between U+0080 to U+07FF and is of the the form: 110xxxxx 10xxxxxx. Therefore, we have 11bits for encoding the characters for the given range. Notice that the form has two leading 1 and total of 2 bytes.

3 Bytes

This encodes the characters between U+0800 to U+FFFF and is of the the form: 1110xxxx 10xxxxxx 10xxxxxx. Therefore, we have 16bits for encoding the characters for the given range. Again, the form has three leading 1 and total of 3 bytes.

4 Bytes

This encodes the characters between U+10000 to U+1FFFFF and is of the the form: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx. Therefore, we have 21bits for encoding the characters for the given range. Four leading 1 and total of 4 bytes.

As an example, this is how the wink character ๐Ÿ˜‰ (U+1F609) is encoded:

  • The code point is 1F609, which states that this will use 4 bytes.
  • Convert this to binary, we get: 11111011000001001 which has a length of 17.
  • Since we have 21bits to show data, we prepend 4 zeros in the front to make it 21bits: 000011111011000001001
  • We put the bits in the template for 4byte encoding: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx where each x is replaced by the corresponding bit of binary 000011111011000001001
  • End result (with spaces added for clarity) is: 11110 000:10 011111:10 011000:10 001001

UTF-16

This is the second variable length encoding option of the Unicode standard, and as the name states, it uses 16bit words (either 1 or 2) to encode the character.

For the first 65536 characters (U+0000 to U+D7FF and U+E000 to U+FFFF), which are enough to show the characters from BMP we use the first byte. In this regard, this is exactly the same as UCS-2. Had the story ended here, we would not have that astral code block in the book. But it turns out that now days we are interested in a lot more symbols โš” and Emojis ๐Ÿ˜‰. Therefore we need to show the characters which have the values between U+10000 to U+10FFFF.

If you notice above, we have a gap between U+D800 to U+DFFF: a gap of 2 bytes. This gap is used to create surrogate pairs, where the first byte (values: U+D800 to U+DBFF) and second byte (values: U+DCOO to U+DFFF) are called High and Low surrogate respectively. The unique combination of two bytes allows us to have 1024 x 1024 = 1,048,567 more values. Good enough.

The thing about surrogate pairs is that the standard forbids the use of these values for anything other than encoding the values between U+10000 to U+10FFFF so we can be assured that there will not be any clashes.

The algorithm for converting these values is:

  • 0x010000 is subtracted from the code point, leaving a 20-bit number in the range 0..0x0FFFFF.
  • The top ten bits (a number in the range 0..0x03FF) are added to 0xD800 to give the first 16-bit code unit or high surrogate, which will be in the range 0xD800..0xDBFF.
  • The low ten bits (also in the range 0..0x03FF) are added to 0xDC00 to give the second 16-bit code unit or low surrogate, which will be in the range 0xDC00..0xDFFF.

Once this theory is clear, I came to understand the following astral function in the book:

hexEscape :: Char -> Doc
hexEscape c | d < 0x10000 = smallHex d
            | otherwise   = astral (d - 0x10000)
  where d = ord c

astral :: Int -> Doc
astral n = smallHex (a + 0xd800) <> smallHex (b + 0xdc00)
    where a = (n `shiftR` 10) .&. 0x3ff
          b = n .&. 0x3ff

Phew! A trip down the Rabbit Hole.