In the last part, I discussed how Unicode is a consistent naming scheme for graphemes, how character encodings such as UTF-8 map Unicode code points to bits, and how fonts describe how code points should be visually displayed. In this part, I discuss the specific things you need to know about using Unicode in Java code.
Java primitives and Unicode
The two most commonly used character encodings for Unicode are UTF-8 and UTF-16. Java uses UTF-16 for char values, and as a result for Strings, since these are just an object wrapper for a char array. UTF-8 is most commonly used when writing files, particularly XML. UTF-16 stores nearly all characters as a sequence of 16 bits, even the ones that could be stored in only 8 bits (e.g., characters in the ASCII range). UTF-8 uses a variable-length encoding scheme that stores ASCII-range characters in 8 bits and other characters in 2 to 6 bytes, depending on the character. For example, the letter "a" (Latin small letter a, U+0061) is represented with 8 bits; "á" (Latin small letter A with acute, U+00E1) is represented with 16 bits, and our beloved snowman (☃) is represented with 24 bits. As I mentioned before, files encoded using ASCII can be read as if they were encoded using UTF-8, and files written using UTF-8 that only contain characters in the ASCII range can be read by Unicode-ignorant programs as if they were ASCII (usually). UTF-16 uses a similar variable-width encoding as UTF-8, but uses increments of 16 bits instead of 8.
From bytes to Strings
The character encoding describes how to map a byte array (byte) to a char array (char), and vice versa. Strings are just wrappers around chars, so this applies to Strings also. The important thing with the mapping is how it describes instances when more than one byte in the array maps to a single char value. This allows a char to represent any Unicode code point from U+0000 to U+FFFF. This range is known as the Basic Multilingual Plane and includes every language that a general-purpose Java application can be expected to support. If your app needs to support Cuneiform or Phoenician, you probably need to read something other than a blog post.
Every Java implementation must support US-ASCII, ISO-8859-1, UTF-8, UTF-16BE, UTF-16LE, and UTF-16 (with byte order mark). US-ASCII and UTF-8 you should recognize. ISO-8859-1 is commonly referred to as Latin-1 and is usually used when only "Western European" languages needed to be supported. It's related to the Windows-1252 encoding used by default on older Windows OSes. UTF-16BE and UTF-16LE encode either as big endian or little endian, which will give a speedup for certain platforms. The default UTF-16 scheme includes the code point U+FEFF as the first two bytes of a document (called the byte order mark), the order of which determines if the rest of the document is big endian or little endian.
However, most Java implementations support a lot more. For instance, MacOS X Java 6 supports: Big5, Big5-HKSCS, EUC-JP, EUC-KR, GB18030, GB2312, GBK, IBM-Thai, IBM00858, IBM01140, IBM01141, IBM01142, IBM01143, IBM01144, IBM01145, IBM01146, IBM01147, IBM01148, IBM01149, IBM037, IBM1026, IBM1047, IBM273, IBM277, IBM278, IBM280, IBM284, IBM285, IBM297, IBM420, IBM424, IBM437, IBM500, IBM775, IBM850, IBM852, IBM855, IBM857, IBM860, IBM861, IBM862, IBM863, IBM864, IBM865, IBM866, IBM868, IBM869, IBM870, IBM871, IBM918, ISO-2022-CN, ISO-2022-JP, ISO-2022-JP-2, ISO-2022-KR, ISO-8859-1, ISO-8859-13, ISO-8859-15, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, JIS_X0201, JIS_X0212-1990, KOI8-R, KOI8-U, MacRoman, Shift_JIS, TIS-620, US-ASCII, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, UTF-8, windows-1250, windows-1251, windows-1252, windows-1253, windows-1254, windows-1255, windows-1256, windows-1257, windows-1258, windows-31j, x-Big5-Solaris, x-euc-jp-linux, x-EUC-TW, x-eucJP-Open, x-IBM1006, x-IBM1025, x-IBM1046, x-IBM1097, x-IBM1098, x-IBM1112, x-IBM1122, x-IBM1123, x-IBM1124, x-IBM1381, x-IBM1383, x-IBM33722, x-IBM737, x-IBM834, x-IBM856, x-IBM874, x-IBM875, x-IBM921, x-IBM922, x-IBM930, x-IBM933, x-IBM935, x-IBM937, x-IBM939, x-IBM942, x-IBM942C, x-IBM943, x-IBM943C, x-IBM948, x-IBM949, x-IBM949C, x-IBM950, x-IBM964, x-IBM970, x-ISCII91, x-ISO-2022-CN-CNS, x-ISO-2022-CN-GB, x-iso-8859-11, x-JIS0208, x-JISAutoDetect, x-Johab, x-MacArabic, x-MacCentralEurope, x-MacCroatian, x-MacCyrillic, x-MacDingbat, x-MacGreek, x-MacHebrew, x-MacIceland, x-MacRomania, x-MacSymbol, x-MacThai, x-MacTurkish, x-MacUkraine, x-MS932_0213, x-MS950-HKSCS, x-mswin-936, x-PCK, x-SJIS_0213, x-UTF-16LE-BOM, X-UTF-32BE-BOM, X-UTF-32LE-BOM, x-windows-50220, x-windows-50221, x-windows-874, x-windows-949, x-windows-950, x-windows-iso2022jp.
In the next part, I'll discuss using Readers and Writers with Unicode.