Multi-extends in generified types

In Effective Java, I came across a language construct I'd never seen before:

public class Foo<T extends List & Comparator> { 
    <U extends List & Comparator> void foo(U x) { }
}

This declares that T must extend or implement both List and Comparator. I've never had occasion to use this, but I can imagine it would be useful. The example Bloch gives in the book is when T is derived from one class and implements an interface.

Unicode in Java: some Groovy pieces (part 7)

One of the common tasks Java developers use Groovy for is testing. One of the common idioms I use is the create a list of strings and use the "each" method to assert that an output file contains them. When testing Unicode, this means both the output files and the Groovy source files contain Unicode characters. For example, the code may contain:

        def contents = new File(outputFile).getText("UTF-8")
 
       [ "D'fhuascail Íosa Úrmhac na hÓighe Beannaithe pór Éava agus Ádhaimh",
         'イロハニホヘト チリヌルヲ ワカヨタレソ ツネナラム',
         'เป็นมนุษย์สุดประเสริฐเลิศคุณค่า'
        ].each{ assertTrue(contents.contains(it), "${it} not in ${outputFile}") }

The first point is that we can no longer use the File#text method, we need to use the getText method that takes a character encoding scheme argument.

The second point is when Java or Groovy source files that contain Unicode characters, the specify what the encoding for those files is. In this case, we've saved our source files in UTF-8 encoding. As with JVM, javac and groovyc will default to using the platform default encoding if none is specified, which would give us odd errors when the non-printable ASCII characters that resulted from incorrectly decoding the UTF-8 where fed to the compiler.

When I call groovyc from Ant, this is code I use:

         <groovyc srcdir="." includes="com/example/**/*.groovy" destdir="${twork}" encoding="UTF-8">
            <classpath refid="example.common.class.path"/>
         </groovyc>

For more on Groovy and Unicode, Guillaume has an excellent post Heads-up on File and Stream groovy methods

Unicode in Java: bytes and charsets (part 6)

In this part, I'll discuss some of the lower-level APIs for converting byte arrays to characters and a bit more about the Charset and CharsetDecoder classes.

The string class has two constructors that will decode a byte[] using a specified charset: String(byte[] bytes, String charsetName) and
String(byte[] bytes, Charset charset). Likewise, it has two instance methods for doing the opposite: byte[] getBytes(String charsetName) and byte[] getBytes(Charset charset). It is almost always wrong to to use the String(byte[]) or byte[] getBytes() methods, since these will use the default platform encoding. It is nearly always better to choose a consistent encoding to use within your application, typically UTF-8, unless you have a good reason to do otherwise.

In the previous part, we used the Charset class to retrieve the default character encoding. We can also use this to retrieve the Charset instance for a given string name with the static method Charset.forName(String charsetName), e.g., Charset.forName("UTF-8"). In addition to String having methods that take either a string name of the encoding or the Charset instance, most of the Reader classes do too. In my previous examples I showed using the version where "UTF-8" is specified, but the better way would be to have a final static attribute that contains the value of Charset.forName("UTF-8") and use this. It eliminates the need to repeated look up the Charset and it prevents a type in the charset name from creating a hard-to-find bug.

The CharsetDecoder class is provided for when you need more control over the decoding process than the String methods provide. This definitely falls into the "advanced" category, so I'm not going to cover it here. Aaron Elkiss has a good writeup as does the javadoc

Unicode in Java: sample data (part 5)

When testing Unicode with your application, you need some examples. Most people don't have Thai or Katakana files sitting around, so finding test data is hard.

I've been playing around with JavaScript and JQuery recently, so I thought I'd build a small app that would render Unicode characters from a variety of languages in a variety of scripts. You can cut-and-paste the examples into your own test files, or since the HTML file contain the characters themselves (instead of the HTML escape codes), you could even use the file as as test data. It even has Klingon :)

unicode_app

Marcus Kuhn has a lot of good examples including "quick brown fox" examples in many languages (unfortunately Chinese is not among them).

Unicode in Java: Default Charset (part 4)

In this part, I will discuss the default Charset and how to change it.

The default character set (technically a character encoding) is set when the JVM starts. Every platform has a default default, but the default can also be configured explicitly. For example, Windows XP 32 bit (English) defaults to "windows-1252", which is the CP1252 encoding that provides for encoding most Western European languages.

The default charset can be printed by calling:

System.out.println(java.nio.charset.Charset.defaultCharset());

When the JVM is started, the default charset can be set with the property "file.encoding", e.g., "-Dfile.encoding=utf-8". Some IDEs will do this automatically, for example, NetBeans uses this property to explicitly set the charset to UTF-8. The drawback to this is that code that uses a class like FileReader that relies on the default encoding may work correctly when handling Unicode in the development environment, but then break when used in an environment that has a different default encoding. The developer should not rely on the user to set the encoding for the code to work correctly.

Also, one might think they could just alter the system property "file.encoding" programmatically. However, this cannot be set after the JVM starts, as by that time all of the system classes which rely on this value have already cached it.

In Linux/Unix, you can also set the LC_ALL to affect the default encoding. For example, on one Linux box I have, the default is US-ASCII. When I set "export LC_ALL=en_US.UTF-8", the default encoding is UTF8.

The environment variables LANG and LC_CTYPE will also have a similar affect (more here).

In summary, the default charset is used by many classes when a character set is not explicitly specified, but this charset should not be relied upon to work correctly when your application is supposed to handle Unicode.

Unicode in Java: Readers and Writers (part 3)

In the previous parts, I've discussed Unicode, encodings, and which encodings are used for Java internally. In this part, I'll discuss using Readers and Writers in a Unicode-compliant way. In short, never use FileReader or FileWriter. This is a particularly important thing to understand because I don't feel any of the Java books I have stated this explicitly enough so that I understood it until I encountered it in the field.

The various Reader and Writer classes in Java almost never to the correct thing by default. Not because they're not well-designed, but because it's largely up to the user to specify what "the correct thing" is. For example, FileReader and FileWriter will always use the default character encoding. This varies widely between platforms, for example, Windows XP 32-bit defaults to CP1252 (a variant of ISO-8859-1), many Linuxes default to US-ASCII, and MacOS X defaults to MacRoman. If you expect your users to input Unicode characters, this will always cause them to be garbled. It is possible to change the default character encoding (which we'll discuss later), but you shouldn't rely on your users to set their environments up in a certain way, particularly when your users are non-technical.

If your application has control over a set of flies, it needs to explicitly specify the character encoding and always use that encoding. Instead of using FileReader and FileWriter, you must use InputStreamReader and OutputStreamWriter with the constructors that take stream and a charset name string, e.g. "UTF-8". This is a bit confusing, since it is referred to as a "charset", even though it's technically a character encoding. Here is what the code should look like:

InputStream istream = ...;
BufferedReader reader = new BufferedReader(new InputStreamReader(istream, "UTF-8"));
 
OutputStream ostream = ...;
Writer writer = new OutputStreamWriter(ostream, "UTF-8");

If you're reading an writing files, you can use the FileOutputStream and FileInputStream implementations for the InputStream and OutputStream instances. The *Stream classes only read and write bytes, so it's the Reader that actually tries to apply an encoding to map the bytes to chars or vice versa. You can pretty much just grep your code for FileReader and FileWriter to find places where support for Unicode will break.

The javadoc for these classes isn't much help unless you're already aware of the issues. The FileOutputStream javadoc says "FileOutputStream is meant for writing streams of raw bytes such as image data. For writing streams of characters, consider using FileWriter. " This is misleading, since if you're naive to the issues with Unicode support, you might think that FileWriter will "just work" if your code expects to handle Unicode. The FileWriter javadoc says "The constructors of this class assume that the default character encoding and the default byte-buffer size are acceptable." If you know what that means, you're okay. But a more useful warning would be "This will almost never write anything other than American English correctly, so don't use it!". I say American English because, for example, the British pound symbol £ isn't included in ASCII.

Now, go and find all of the places in your code where this is broken and fix it.

In the next part, I'll discuss more about the default character set.

Unicode in Java: primitives and encodings (part 2)

In the last part, I discussed how Unicode is a consistent naming scheme for graphemes, how character encodings such as UTF-8 map Unicode code points to bits, and how fonts describe how code points should be visually displayed. In this part, I discuss the specific things you need to know about using Unicode in Java code.

Java primitives and Unicode

The two most commonly used character encodings for Unicode are UTF-8 and UTF-16. Java uses UTF-16 for char values, and as a result for Strings, since these are just an object wrapper for a char array. UTF-8 is most commonly used when writing files, particularly XML. UTF-16 stores nearly all characters as a sequence of 16 bits, even the ones that could be stored in only 8 bits (e.g., characters in the ASCII range). UTF-8 uses a variable-length encoding scheme that stores ASCII-range characters in 8 bits and other characters in 2 to 6 bytes, depending on the character. For example, the letter "a" (Latin small letter a, U+0061) is represented with 8 bits; "á" (Latin small letter A with acute, U+00E1) is represented with 16 bits, and our beloved snowman (☃) is represented with 24 bits. As I mentioned before, files encoded using ASCII can be read as if they were encoded using UTF-8, and files written using UTF-8 that only contain characters in the ASCII range can be read by Unicode-ignorant programs as if they were ASCII (usually). UTF-16 uses a similar variable-width encoding as UTF-8, but uses increments of 16 bits instead of 8.

From bytes to Strings

The character encoding describes how to map a byte array (byte[]) to a char array (char[]), and vice versa. Strings are just wrappers around char[]s, so this applies to Strings also. The important thing with the mapping is how it describes instances when more than one byte in the array maps to a single char value. This allows a char to represent any Unicode code point from U+0000 to U+FFFF. This range is known as the Basic Multilingual Plane and includes every language that a general-purpose Java application can be expected to support. If your app needs to support Cuneiform or Phoenician, you probably need to read something other than a blog post.

Encoding support

Every Java implementation must support US-ASCII, ISO-8859-1, UTF-8, UTF-16BE, UTF-16LE, and UTF-16 (with byte order mark). US-ASCII and UTF-8 you should recognize. ISO-8859-1 is commonly referred to as Latin-1 and is usually used when only "Western European" languages needed to be supported. It's related to the Windows-1252 encoding used by default on older Windows OSes. UTF-16BE and UTF-16LE encode either as big endian or little endian, which will give a speedup for certain platforms. The default UTF-16 scheme includes the code point U+FEFF as the first two bytes of a document (called the byte order mark), the order of which determines if the rest of the document is big endian or little endian.

However, most Java implementations support a lot more. For instance, MacOS X Java 6 supports: Big5, Big5-HKSCS, EUC-JP, EUC-KR, GB18030, GB2312, GBK, IBM-Thai, IBM00858, IBM01140, IBM01141, IBM01142, IBM01143, IBM01144, IBM01145, IBM01146, IBM01147, IBM01148, IBM01149, IBM037, IBM1026, IBM1047, IBM273, IBM277, IBM278, IBM280, IBM284, IBM285, IBM297, IBM420, IBM424, IBM437, IBM500, IBM775, IBM850, IBM852, IBM855, IBM857, IBM860, IBM861, IBM862, IBM863, IBM864, IBM865, IBM866, IBM868, IBM869, IBM870, IBM871, IBM918, ISO-2022-CN, ISO-2022-JP, ISO-2022-JP-2, ISO-2022-KR, ISO-8859-1, ISO-8859-13, ISO-8859-15, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, JIS_X0201, JIS_X0212-1990, KOI8-R, KOI8-U, MacRoman, Shift_JIS, TIS-620, US-ASCII, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, UTF-8, windows-1250, windows-1251, windows-1252, windows-1253, windows-1254, windows-1255, windows-1256, windows-1257, windows-1258, windows-31j, x-Big5-Solaris, x-euc-jp-linux, x-EUC-TW, x-eucJP-Open, x-IBM1006, x-IBM1025, x-IBM1046, x-IBM1097, x-IBM1098, x-IBM1112, x-IBM1122, x-IBM1123, x-IBM1124, x-IBM1381, x-IBM1383, x-IBM33722, x-IBM737, x-IBM834, x-IBM856, x-IBM874, x-IBM875, x-IBM921, x-IBM922, x-IBM930, x-IBM933, x-IBM935, x-IBM937, x-IBM939, x-IBM942, x-IBM942C, x-IBM943, x-IBM943C, x-IBM948, x-IBM949, x-IBM949C, x-IBM950, x-IBM964, x-IBM970, x-ISCII91, x-ISO-2022-CN-CNS, x-ISO-2022-CN-GB, x-iso-8859-11, x-JIS0208, x-JISAutoDetect, x-Johab, x-MacArabic, x-MacCentralEurope, x-MacCroatian, x-MacCyrillic, x-MacDingbat, x-MacGreek, x-MacHebrew, x-MacIceland, x-MacRomania, x-MacSymbol, x-MacThai, x-MacTurkish, x-MacUkraine, x-MS932_0213, x-MS950-HKSCS, x-mswin-936, x-PCK, x-SJIS_0213, x-UTF-16LE-BOM, X-UTF-32BE-BOM, X-UTF-32LE-BOM, x-windows-50220, x-windows-50221, x-windows-874, x-windows-949, x-windows-950, x-windows-iso2022jp.

In the next part, I'll discuss using Readers and Writers with Unicode.

Unicode in Java: introduction (part 1)

The bad old days

A long time ago, things were much easier for programmers. The only computers anyone cared about were in the US, and these computers only needed to render "normal" letters like "a" and "Q". Then the internet came along, and we realized that there were all of these other people in the world that had other languages with crazy letters like ð and ß and བོ , and even symbols that represent entire words like 中 and 말.

Back then, most programmers only needed to worry about [0-9a-zA-Z], these were most commonly represented as ASCII. All of the characters were encoded as 7 bits and padded with one extra bit to make an 8 bit sequence, so only a total of 128 characters were represented.

Unfortunately, 8 bits can't represent the thousands of basic units of a language used throughout the world. We use the word grapheme to describe these basic units because they vary widely between languages. For example, in English this could be a letter like "A" and in Chinese it could be an ideograph like 中. Before Unicode, there were dozens of other schemes in common use that covered different subsets of the problem, but none of which provided a unified approach. For example, ISO 8859-1 and ISO 8859-2 were commonly used for Western European languages that use diacritics (commonly called "accented" characters); ISO 8859-7 for Greek; KOI-8, ISO 8859-5, and CP1251 for Cyrillic alphabets (e.g., Russian and Ukranian); EUC and Shift-JIS for Japanese; BIG5 for traditional Chinese characters (Taiwan); GB for simplified Chinese characters (China).

If you wanted to mix these together in the same text string, good luck.

Unicode to the rescue

To solve this issue, Unicode and series of encodings were created. Unicode is only a consistent way of naming the graphemes and does not describe how they should be encoded into a bit pattern.

Each Unicode character is referred to by a four digit number prefixed by "U+", so "A" is represented by U+0041 and described as "LATIN CAPITAL LETTER A", and U+2603 is "SNOWMAN" (not kidding: ☃). ASCII had so few characters that the description of which character is which and the bit encoding of the characters aren't separated. In Unicode they are, so you don't have to describe the Icelandic character ð as "that d with the slash in it", and can instead refer to it by a standardized code, U+00F0. It gets even messier when referring to some Asian languages that share what are essentially the same grapheme, but written in different ways (see Han unification). There are also a significant number of symbol-like things in Unicode, so the casual observer would not be able tell ☸ (wheel of dharma, U+2638) from ⎈ (helm symbol, U+2388). Unicode makes it very explicit which grapheme is which.

To reiterate, Unicode doesn't describe how the character should be represented in bits (encoded) nor does it describe what the character should actually look like when displayed. It's only providing a mapping between numbers (called code points) like U+0041 and U+2603 and abstract things, like English letters, Chinese ideographs, and snowpersons.

Character encoding

The next issue is, how to we physically store these Unicode code points as bits? This is referred to as a character encoding, and describes a mapping between the code points and a sequence of bits (although it probably should be referred to as grapheme encoding). In ASCII, each character is stored in 8 bits, but 8 bits limit the number of characters that can be represented to 256. To represent the thousands of Unicode code points, we need to have an encoding that uses more than 8 bits. However, we already have millions of files that are encoded in 8 bits with ASCII. Ideally we'd like our new encoding to be backwards compatible, so we don't have our legacy ASCII files garbled if they were read as if they were in our new encoding. This is where UTF-8 comes in.

UTF-8 is an encoding for Unicode code points, hence its acronym Unicode Transformation Format. UTF-8 is known as a variable-length encoding because some code points are represented by 8 bits and others by 16 bits (or more). The cool thing is that all of the characters which can be represented in ASCII have the same bit encodings in ASCII and UTF-8, so trying to read an ASCII-encoded file as UTF-8 will just work. Trying to read a UTF-8 encoded file as if it were ASCII (as many Unicode-ignorant programs do) results in characters encoded in 16 bits being read as if they were two 8 bit characters, so instead of a Chinese character, you get a capital Q and a ASCII beep.

UTF-16 is similar to UTF-8, but instead of encoding characters as multiples of 8 bits, all characters are encoded as multiples of 16 bits. The drawbacks here are that if the text primarily consists of characters in the ASCII range, it takes up twice the amount of storage space. Also, files which mostly contain mostly ASCII can't be read at all in editors which don't understand ASCII, rather than just incorrectly displaying characters outside of the ASCII range.

Fonts

The final piece of this is fonts. A font describes how a character (code point) should be displayed on the screen. Useful fonts look like glyphs people recognize. Before Unicode was prevalent and we could use U+2620 to represent a skull and crossbones (☠), there were fonts like Wingdings that displayed a symbol in place of a letter. For example, "N" in wingdings is a skull and crossbones, but it's still (technically) an N, it's just no one would recognize it as such. It's very important to recognize the difference between the code point, the character encoding, and the font describing the visual display.

In the next part, we'll discuss how Unicode and character encodings are used in Java.

Additional Resources

Joel Spolsky's great intro to Unicode in general, which sounds a lot like this post
Jukka K. Korpela's tutorial on character code issues