Skip to content

Unicode in Java: Readers and Writers (part 3)

In the previous parts, I've discussed Unicode, encodings, and which encodings are used for Java internally. In this part, I'll discuss using Readers and Writers in a Unicode-compliant way. In short, never use FileReader or FileWriter. This is a particularly important thing to understand because I don't feel any of the Java books I have stated this explicitly enough so that I understood it until I encountered it in the field.

The various Reader and Writer classes in Java almost never to the correct thing by default. Not because they're not well-designed, but because it's largely up to the user to specify what "the correct thing" is. For example, FileReader and FileWriter will always use the default character encoding. This varies widely between platforms, for example, Windows XP 32-bit defaults to CP1252 (a variant of ISO-8859-1), many Linuxes default to US-ASCII, and MacOS X defaults to MacRoman. If you expect your users to input Unicode characters, this will always cause them to be garbled. It is possible to change the default character encoding (which we'll discuss later), but you shouldn't rely on your users to set their environments up in a certain way, particularly when your users are non-technical.

If your application has control over a set of flies, it needs to explicitly specify the character encoding and always use that encoding. Instead of using FileReader and FileWriter, you must use InputStreamReader and OutputStreamWriter with the constructors that take stream and a charset name string, e.g. "UTF-8". This is a bit confusing, since it is referred to as a "charset", even though it's technically a character encoding. Here is what the code should look like:

InputStream istream = ...;
BufferedReader reader = new BufferedReader(new InputStreamReader(istream, "UTF-8"));
 
OutputStream ostream = ...;
Writer writer = new OutputStreamWriter(ostream, "UTF-8");

If you're reading an writing files, you can use the FileOutputStream and FileInputStream implementations for the InputStream and OutputStream instances. The *Stream classes only read and write bytes, so it's the Reader that actually tries to apply an encoding to map the bytes to chars or vice versa. You can pretty much just grep your code for FileReader and FileWriter to find places where support for Unicode will break.

The javadoc for these classes isn't much help unless you're already aware of the issues. The FileOutputStream javadoc says "FileOutputStream is meant for writing streams of raw bytes such as image data. For writing streams of characters, consider using FileWriter. " This is misleading, since if you're naive to the issues with Unicode support, you might think that FileWriter will "just work" if your code expects to handle Unicode. The FileWriter javadoc says "The constructors of this class assume that the default character encoding and the default byte-buffer size are acceptable." If you know what that means, you're okay. But a more useful warning would be "This will almost never write anything other than American English correctly, so don't use it!". I say American English because, for example, the British pound symbol £ isn't included in ASCII.

Now, go and find all of the places in your code where this is broken and fix it.

In the next part, I'll discuss more about the default character set.

Post a Comment

Your email is never published nor shared. Required fields are marked *
*
*