Skip to content

Unicode in Java: introduction (part 1)

The bad old days

A long time ago, things were much easier for programmers. The only computers anyone cared about were in the US, and these computers only needed to render "normal" letters like "a" and "Q". Then the internet came along, and we realized that there were all of these other people in the world that had other languages with crazy letters like ð and ß and བོ , and even symbols that represent entire words like 中 and 말.

Back then, most programmers only needed to worry about [0-9a-zA-Z], these were most commonly represented as ASCII. All of the characters were encoded as 7 bits and padded with one extra bit to make an 8 bit sequence, so only a total of 128 characters were represented.

Unfortunately, 8 bits can't represent the thousands of basic units of a language used throughout the world. We use the word grapheme to describe these basic units because they vary widely between languages. For example, in English this could be a letter like "A" and in Chinese it could be an ideograph like 中. Before Unicode, there were dozens of other schemes in common use that covered different subsets of the problem, but none of which provided a unified approach. For example, ISO 8859-1 and ISO 8859-2 were commonly used for Western European languages that use diacritics (commonly called "accented" characters); ISO 8859-7 for Greek; KOI-8, ISO 8859-5, and CP1251 for Cyrillic alphabets (e.g., Russian and Ukranian); EUC and Shift-JIS for Japanese; BIG5 for traditional Chinese characters (Taiwan); GB for simplified Chinese characters (China).

If you wanted to mix these together in the same text string, good luck.

Unicode to the rescue

To solve this issue, Unicode and series of encodings were created. Unicode is only a consistent way of naming the graphemes and does not describe how they should be encoded into a bit pattern.

Each Unicode character is referred to by a four digit number prefixed by "U+", so "A" is represented by U+0041 and described as "LATIN CAPITAL LETTER A", and U+2603 is "SNOWMAN" (not kidding: ☃). ASCII had so few characters that the description of which character is which and the bit encoding of the characters aren't separated. In Unicode they are, so you don't have to describe the Icelandic character ð as "that d with the slash in it", and can instead refer to it by a standardized code, U+00F0. It gets even messier when referring to some Asian languages that share what are essentially the same grapheme, but written in different ways (see Han unification). There are also a significant number of symbol-like things in Unicode, so the casual observer would not be able tell ☸ (wheel of dharma, U+2638) from ⎈ (helm symbol, U+2388). Unicode makes it very explicit which grapheme is which.

To reiterate, Unicode doesn't describe how the character should be represented in bits (encoded) nor does it describe what the character should actually look like when displayed. It's only providing a mapping between numbers (called code points) like U+0041 and U+2603 and abstract things, like English letters, Chinese ideographs, and snowpersons.

Character encoding

The next issue is, how to we physically store these Unicode code points as bits? This is referred to as a character encoding, and describes a mapping between the code points and a sequence of bits (although it probably should be referred to as grapheme encoding). In ASCII, each character is stored in 8 bits, but 8 bits limit the number of characters that can be represented to 256. To represent the thousands of Unicode code points, we need to have an encoding that uses more than 8 bits. However, we already have millions of files that are encoded in 8 bits with ASCII. Ideally we'd like our new encoding to be backwards compatible, so we don't have our legacy ASCII files garbled if they were read as if they were in our new encoding. This is where UTF-8 comes in.

UTF-8 is an encoding for Unicode code points, hence its acronym Unicode Transformation Format. UTF-8 is known as a variable-length encoding because some code points are represented by 8 bits and others by 16 bits (or more). The cool thing is that all of the characters which can be represented in ASCII have the same bit encodings in ASCII and UTF-8, so trying to read an ASCII-encoded file as UTF-8 will just work. Trying to read a UTF-8 encoded file as if it were ASCII (as many Unicode-ignorant programs do) results in characters encoded in 16 bits being read as if they were two 8 bit characters, so instead of a Chinese character, you get a capital Q and a ASCII beep.

UTF-16 is similar to UTF-8, but instead of encoding characters as multiples of 8 bits, all characters are encoded as multiples of 16 bits. The drawbacks here are that if the text primarily consists of characters in the ASCII range, it takes up twice the amount of storage space. Also, files which mostly contain mostly ASCII can't be read at all in editors which don't understand ASCII, rather than just incorrectly displaying characters outside of the ASCII range.

Fonts

The final piece of this is fonts. A font describes how a character (code point) should be displayed on the screen. Useful fonts look like glyphs people recognize. Before Unicode was prevalent and we could use U+2620 to represent a skull and crossbones (☠), there were fonts like Wingdings that displayed a symbol in place of a letter. For example, "N" in wingdings is a skull and crossbones, but it's still (technically) an N, it's just no one would recognize it as such. It's very important to recognize the difference between the code point, the character encoding, and the font describing the visual display.

In the next part, we'll discuss how Unicode and character encodings are used in Java.

Additional Resources

Joel Spolsky's great intro to Unicode in general, which sounds a lot like this post
Jukka K. Korpela's tutorial on character code issues

One Trackback/Pingback

  1. [...] the last part, I discussed how Unicode is a consistent naming scheme for graphemes, how character encodings such [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *
*
*