Unicode in Java: bytes and charsets (part 6)

In this part, I'll discuss some of the lower-level APIs for converting byte arrays to characters and a bit more about the Charset and CharsetDecoder classes.

The string class has two constructors that will decode a byte[] using a specified charset: String(byte[] bytes, String charsetName) and
String(byte[] bytes, Charset charset). Likewise, it has two instance methods for doing the opposite: byte[] getBytes(String charsetName) and byte[] getBytes(Charset charset). It is almost always wrong to to use the String(byte[]) or byte[] getBytes() methods, since these will use the default platform encoding. It is nearly always better to choose a consistent encoding to use within your application, typically UTF-8, unless you have a good reason to do otherwise.

In the previous part, we used the Charset class to retrieve the default character encoding. We can also use this to retrieve the Charset instance for a given string name with the static method Charset.forName(String charsetName), e.g., Charset.forName("UTF-8"). In addition to String having methods that take either a string name of the encoding or the Charset instance, most of the Reader classes do too. In my previous examples I showed using the version where "UTF-8" is specified, but the better way would be to have a final static attribute that contains the value of Charset.forName("UTF-8") and use this. It eliminates the need to repeated look up the Charset and it prevents a type in the charset name from creating a hard-to-find bug.

The CharsetDecoder class is provided for when you need more control over the decoding process than the String methods provide. This definitely falls into the "advanced" category, so I'm not going to cover it here. Aaron Elkiss has a good writeup as does the javadoc

Leave a Reply

Your email address will not be published. Required fields are marked *


You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">