Unicode in Java: Default Charset (part 4)

In this part, I will discuss the default Charset and how to change it.

The default character set (technically a character encoding) is set when the JVM starts. Every platform has a default default, but the default can also be configured explicitly. For example, Windows XP 32 bit (English) defaults to "windows-1252", which is the CP1252 encoding that provides for encoding most Western European languages.

The default charset can be printed by calling:

System.out.println(java.nio.charset.Charset.defaultCharset());

When the JVM is started, the default charset can be set with the property "file.encoding", e.g., "-Dfile.encoding=utf-8". Some IDEs will do this automatically, for example, NetBeans uses this property to explicitly set the charset to UTF-8. The drawback to this is that code that uses a class like FileReader that relies on the default encoding may work correctly when handling Unicode in the development environment, but then break when used in an environment that has a different default encoding. The developer should not rely on the user to set the encoding for the code to work correctly.

Also, one might think they could just alter the system property "file.encoding" programmatically. However, this cannot be set after the JVM starts, as by that time all of the system classes which rely on this value have already cached it.

In Linux/Unix, you can also set the LC_ALL to affect the default encoding. For example, on one Linux box I have, the default is US-ASCII. When I set "export LC_ALL=en_US.UTF-8", the default encoding is UTF8.

The environment variables LANG and LC_CTYPE will also have a similar affect (more here).

In summary, the default charset is used by many classes when a character set is not explicitly specified, but this charset should not be relied upon to work correctly when your application is supposed to handle Unicode.

3 thoughts on “Unicode in Java: Default Charset (part 4)

  • December 2, 2009 at 10:58 am
    Permalink

    Excellent Analyzation

  • August 17, 2014 at 9:21 am
    Permalink

    LC_ALL doesn't work for setting the default file encoding…

    [briemers@briemersw tmp]$ uname -a
    Linux briemersw.docbill.info 3.15.7-200.fc20.x86_64 #1 SMP Mon Jul 28 18:50:26 UTC 2014 x86_64 x86_64 x86_64 GNU/Linux
    [briemers@briemersw tmp]$ cat CharSetTest.java
    import java.util.*;
    import java.io.*;
    import java.nio.charset.*;

    public class CharSetTest {

    public static void main(String[] args) {
    System.out.println("Default Charset=" + Charset.defaultCharset());
    System.setProperty("file.encoding", "Latin-1");
    System.out.println("file.encoding=" + System.getProperty("file.encoding"));
    System.out.println("Default Charset=" + Charset.defaultCharset());
    System.out.println("Default Charset in Use=" + getDefaultCharSet());
    }

    private static String getDefaultCharSet() {
    OutputStreamWriter writer = new OutputStreamWriter(new ByteArrayOutputStream());
    String enc = writer.getEncoding();
    return enc;
    }
    }
    [briemers@briemersw tmp]$ echo $LC_ALL
    en_CA.utf8
    [briemers@briemersw tmp]$ LC_ALL=en_CA.utf8 java CharSetTest
    Default Charset=UTF-8
    file.encoding=Latin-1
    Default Charset=UTF-8
    Default Charset in Use=UTF8
    [briemers@briemersw tmp]$ LC_ALL=en_CA.utf-8 java CharSetTest
    Default Charset=UTF-8
    file.encoding=Latin-1
    Default Charset=UTF-8
    Default Charset in Use=UTF8

  • August 17, 2014 at 9:23 am
    Permalink

    Oh stupid me, I didn't even look at the code I cut & pasted from different source to see they explicitly set file.encoding before checking it. Duhhh…

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="">