<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>(\_ -&#62; Phil Varner)</title>
	<atom:link href="http://www.philvarner.com/blog/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.philvarner.com/blog</link>
	<description>mostly technical stuff</description>
	<lastBuildDate>Sat, 06 Mar 2010 18:01:52 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.1</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>ClamAVj : a Java library for accessing the ClamAV clamd daemon</title>
		<link>http://www.philvarner.com/blog/2010/03/06/clamavj-a-java-library-for-accessing-the-clamav-clamd-daemon/</link>
		<comments>http://www.philvarner.com/blog/2010/03/06/clamavj-a-java-library-for-accessing-the-clamav-clamd-daemon/#comments</comments>
		<pubDate>Sat, 06 Mar 2010 18:01:52 +0000</pubDate>
		<dc:creator>Phil</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.philvarner.com/blog/?p=346</guid>
		<description><![CDATA[I wrote some code last week to scan files against the ClamAV antivirus scanner using the clamd daemon.  It&#039;s up now on Google Code under the Apache 2.0 license.
]]></description>
			<content:encoded><![CDATA[<p>I wrote some code last week to scan files against the ClamAV antivirus scanner using the clamd daemon.  It&#039;s up now on <a href="http://code.google.com/p/clamavj/">Google Code</a> under the Apache 2.0 license.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.philvarner.com/blog/2010/03/06/clamavj-a-java-library-for-accessing-the-clamav-clamd-daemon/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>C3P0</title>
		<link>http://www.philvarner.com/blog/2009/11/02/c3p0/</link>
		<comments>http://www.philvarner.com/blog/2009/11/02/c3p0/#comments</comments>
		<pubDate>Mon, 02 Nov 2009 08:01:09 +0000</pubDate>
		<dc:creator>Phil</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.philvarner.com/blog/?p=336</guid>
		<description><![CDATA[It&#039;s a little scary that that his valid Java:

public class Main &#123;
&#160;
    static class OX &#123;
        public static double C3P0;
    &#125;
&#160;
    public static void main&#40;String&#91;&#93; args&#41; &#123;
        OX.C3P0 = 0X.C3P0;
   [...]]]></description>
			<content:encoded><![CDATA[<p>It&#039;s a little scary that that his valid Java:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">class</span> Main <span style="color: #009900;">&#123;</span>
&nbsp;
    <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #000000; font-weight: bold;">class</span> OX <span style="color: #009900;">&#123;</span>
        <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #000066; font-weight: bold;">double</span> C3P0<span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span>
&nbsp;
    <span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">static</span> <span style="color: #000066; font-weight: bold;">void</span> main<span style="color: #009900;">&#40;</span><span style="color: #003399;">String</span><span style="color: #009900;">&#91;</span><span style="color: #009900;">&#93;</span> args<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span>
        OX.<span style="color: #006633;">C3P0</span> <span style="color: #339933;">=</span> 0X.<span style="color: #006633;">C3P0</span><span style="color: #339933;">;</span>
    <span style="color: #009900;">&#125;</span>
&nbsp;
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>Hint: see the <a href="http://java.sun.com/docs/books/jls/third_edition/html/lexical.html">lexical structure</a> of hex literals.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.philvarner.com/blog/2009/11/02/c3p0/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Multi-extends in generified types</title>
		<link>http://www.philvarner.com/blog/2009/10/27/multi-extends-in-generified-types/</link>
		<comments>http://www.philvarner.com/blog/2009/10/27/multi-extends-in-generified-types/#comments</comments>
		<pubDate>Wed, 28 Oct 2009 05:24:03 +0000</pubDate>
		<dc:creator>Phil</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.philvarner.com/blog/?p=112</guid>
		<description><![CDATA[In Effective Java, I came across a language construct I&#039;d never seen before:

public class Foo&#60;T extends List &#38; Comparator&#62; &#123; 
    &#60;U extends List &#38; Comparator&#62; void foo&#40;U x&#41; &#123; &#125;
&#125;

This declares that T must extend or implement both List and Comparator.  I&#039;ve never had occasion to use this, but I [...]]]></description>
			<content:encoded><![CDATA[<p>In <em>Effective Java</em>, I came across a language construct I&#039;d never seen before:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #000000; font-weight: bold;">public</span> <span style="color: #000000; font-weight: bold;">class</span> Foo<span style="color: #339933;">&lt;</span>T <span style="color: #000000; font-weight: bold;">extends</span> <span style="color: #003399;">List</span> <span style="color: #339933;">&amp;</span> Comparator<span style="color: #339933;">&gt;</span> <span style="color: #009900;">&#123;</span> 
    <span style="color: #339933;">&lt;</span>U <span style="color: #000000; font-weight: bold;">extends</span> <span style="color: #003399;">List</span> <span style="color: #339933;">&amp;</span> Comparator<span style="color: #339933;">&gt;</span> <span style="color: #000066; font-weight: bold;">void</span> foo<span style="color: #009900;">&#40;</span>U x<span style="color: #009900;">&#41;</span> <span style="color: #009900;">&#123;</span> <span style="color: #009900;">&#125;</span>
<span style="color: #009900;">&#125;</span></pre></div></div>

<p>This declares that T must extend or implement both List and Comparator.  I&#039;ve never had occasion to use this, but I can imagine it would be useful.  The example Bloch gives in the book is when T is derived from one class and implements an interface. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.philvarner.com/blog/2009/10/27/multi-extends-in-generified-types/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Unicode in Java: some Groovy pieces (part 7)</title>
		<link>http://www.philvarner.com/blog/2009/10/27/unicode-in-java-some-groovy-pieces-part-7/</link>
		<comments>http://www.philvarner.com/blog/2009/10/27/unicode-in-java-some-groovy-pieces-part-7/#comments</comments>
		<pubDate>Wed, 28 Oct 2009 05:00:40 +0000</pubDate>
		<dc:creator>Phil</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.philvarner.com/blog/?p=277</guid>
		<description><![CDATA[One of the common tasks Java developers use Groovy for is testing.  One of the common idioms I use is the create a list of strings and use the &#034;each&#034; method to assert that an output file contains them.  When testing Unicode, this means both the output files and the Groovy source files [...]]]></description>
			<content:encoded><![CDATA[<p>One of the common tasks Java developers use Groovy for is testing.  One of the common idioms I use is the create a list of strings and use the &#034;each&#034; method to assert that an output file contains them.  When testing Unicode, this means both the output files and the Groovy source files contain Unicode characters.  For example, the code may contain:</p>

<div class="wp_syntax"><div class="code"><pre class="groovy" style="font-family:monospace;">        <span style="color: #000000; font-weight: bold;">def</span> contents <span style="color: #66cc66;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #aaaadd; font-weight: bold;">File</span><span style="color: #66cc66;">&#40;</span>outputFile<span style="color: #66cc66;">&#41;</span>.<span style="color: #FFCC33;">getText</span><span style="color: #66cc66;">&#40;</span><span style="color: #ff0000;">&quot;UTF-8&quot;</span><span style="color: #66cc66;">&#41;</span>
&nbsp;
       <span style="color: #66cc66;">&#91;</span> <span style="color: #ff0000;">&quot;D'fhuascail Íosa Úrmhac na hÓighe Beannaithe pór Éava agus Ádhaimh&quot;</span>,
         <span style="color: #ff0000;">'イロハニホヘト チリヌルヲ ワカヨタレソ ツネナラム'</span>,
         <span style="color: #ff0000;">'เป็นมนุษย์สุดประเสริฐเลิศคุณค่า'</span>
        <span style="color: #66cc66;">&#93;</span>.<span style="color: #663399;">each</span><span style="color: #66cc66;">&#123;</span> assertTrue<span style="color: #66cc66;">&#40;</span>contents.<span style="color: #CC0099;">contains</span><span style="color: #66cc66;">&#40;</span>it<span style="color: #66cc66;">&#41;</span>, <span style="color: #ff0000;">&quot;${it} not in ${outputFile}&quot;</span><span style="color: #66cc66;">&#41;</span> <span style="color: #66cc66;">&#125;</span></pre></div></div>

<p>The first point is that we can no longer use the File#text method, we need to use the getText method that takes a character encoding scheme argument.  </p>
<p>The second point is when Java or Groovy source files that contain Unicode characters, the specify what the encoding for those files is.  In this case, we&#039;ve saved our source files in UTF-8 encoding. As with JVM, javac and groovyc will default to using the platform default encoding if none is specified, which would give us odd errors when the non-printable ASCII characters that resulted from incorrectly decoding the UTF-8 where fed to the compiler. </p>
<p>When I call groovyc from Ant, this is code I use:</p>

<div class="wp_syntax"><div class="code"><pre class="xml" style="font-family:monospace;">         <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;groovyc</span> <span style="color: #000066;">srcdir</span>=<span style="color: #ff0000;">&quot;.&quot;</span> <span style="color: #000066;">includes</span>=<span style="color: #ff0000;">&quot;com/example/**/*.groovy&quot;</span> <span style="color: #000066;">destdir</span>=<span style="color: #ff0000;">&quot;${twork}&quot;</span> <span style="color: #000066;">encoding</span>=<span style="color: #ff0000;">&quot;UTF-8&quot;</span><span style="color: #000000; font-weight: bold;">&gt;</span></span>
            <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;classpath</span> <span style="color: #000066;">refid</span>=<span style="color: #ff0000;">&quot;example.common.class.path&quot;</span><span style="color: #000000; font-weight: bold;">/&gt;</span></span>
         <span style="color: #009900;"><span style="color: #000000; font-weight: bold;">&lt;/groovyc<span style="color: #000000; font-weight: bold;">&gt;</span></span></span></pre></div></div>

<p>For more on Groovy and Unicode, Guillaume has an excellent post <a href="http://glaforge.free.fr/weblog/index.php?itemid=74">Heads-up on File and Stream groovy methods</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.philvarner.com/blog/2009/10/27/unicode-in-java-some-groovy-pieces-part-7/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Unicode in Java: bytes and charsets (part 6)</title>
		<link>http://www.philvarner.com/blog/2009/10/26/unicode-in-java-bytes-and-chars-part/</link>
		<comments>http://www.philvarner.com/blog/2009/10/26/unicode-in-java-bytes-and-chars-part/#comments</comments>
		<pubDate>Tue, 27 Oct 2009 03:32:07 +0000</pubDate>
		<dc:creator>Phil</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.philvarner.com/blog/?p=252</guid>
		<description><![CDATA[In this part, I&#039;ll discuss some of the lower-level APIs for converting byte arrays to characters and a bit more about the Charset and CharsetDecoder classes.
The string class has two constructors that will decode a byte[] using a specified charset: String(byte[] bytes, String charsetName) and
String(byte[] bytes, Charset charset).  Likewise, it has two instance methods [...]]]></description>
			<content:encoded><![CDATA[<p>In this part, I&#039;ll discuss some of the lower-level APIs for converting byte arrays to characters and a bit more about the Charset and CharsetDecoder classes.</p>
<p>The string class has two constructors that will decode a byte[] using a specified charset: String(byte[] bytes, String charsetName) and<br />
String(byte[] bytes, Charset charset).  Likewise, it has two instance methods for doing the opposite: byte[] getBytes(String charsetName) and byte[] getBytes(Charset charset).  It is almost always wrong to to use the String(byte[]) or byte[] getBytes() methods, since these will use the default platform encoding.  It is nearly always better to choose a consistent encoding to use within your application, typically UTF-8, unless you have a good reason to do otherwise.</p>
<p>In the previous part, we used the Charset class to retrieve the default character encoding.  We can also use this to retrieve the Charset instance for a given string name with the static method Charset.forName(String charsetName), e.g., Charset.forName(&#034;UTF-8&#034;).  In addition to String having methods that take either a string name of the encoding or the Charset instance, most of the Reader classes do too.  In my previous examples I showed using the version where &#034;UTF-8&#034; is specified, but the better way would be to have a final static attribute that contains the value of Charset.forName(&#034;UTF-8&#034;) and use this.  It eliminates the need to repeated look up the Charset and it prevents a type in the charset name from creating a hard-to-find bug.</p>
<p>The CharsetDecoder class is provided for when you need more control over the decoding process than the String methods provide.  This definitely falls into the &#034;advanced&#034; category, so I&#039;m not going to cover it here.  Aaron Elkiss has a <a href="http://www.umiacs.umd.edu/~aelkiss/xml/java/encode4.html">good writeup</a> as does the <a href="http://java.sun.com/javase/6/docs/api/java/nio/charset/CharsetEncoder.html">javadoc</a> </p>
]]></content:encoded>
			<wfw:commentRss>http://www.philvarner.com/blog/2009/10/26/unicode-in-java-bytes-and-chars-part/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Unicode in Java: sample data (part 5)</title>
		<link>http://www.philvarner.com/blog/2009/10/26/unicode-in-java-sample-dat-part-5/</link>
		<comments>http://www.philvarner.com/blog/2009/10/26/unicode-in-java-sample-dat-part-5/#comments</comments>
		<pubDate>Tue, 27 Oct 2009 02:59:59 +0000</pubDate>
		<dc:creator>Phil</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.philvarner.com/blog/?p=294</guid>
		<description><![CDATA[When testing Unicode with your application, you need some examples.  Most people don&#039;t have Thai or Katakana files sitting around, so finding test data is hard.  
I&#039;ve been playing around with JavaScript and JQuery recently, so I thought I&#039;d build a small app that would render Unicode characters from a variety of languages [...]]]></description>
			<content:encoded><![CDATA[<p>When testing Unicode with your application, you need some examples.  Most people don&#039;t have Thai or Katakana files sitting around, so finding test data is hard.  </p>
<p>I&#039;ve been playing around with JavaScript and JQuery recently, so I thought I&#039;d build a small <a href="http://www.philvarner.com/unicode/">app</a> that would render Unicode characters from a variety of languages in a variety of scripts.  You can cut-and-paste the examples into your own test files, or since the HTML file contain the characters themselves (instead of the HTML escape codes), you could even use the file as as test data.  It even has Klingon :)</p>
<p><img src="http://www.philvarner.com/blog/wp-content/uploads/2009/10/unicode_app.png" alt="unicode_app" title="unicode_app" width="500" height="351" class="aligncenter size-full wp-image-295" /></p>
<p>Marcus Kuhn has a lot of good <a href="http://www.cl.cam.ac.uk/~mgk25/ucs/examples/">examples</a> including &#034;quick brown fox&#034; examples in <a href="http://www.cl.cam.ac.uk/~mgk25/ucs/examples/quickbrown.txt">many languages</a> (unfortunately Chinese is not among them).</p>
]]></content:encoded>
			<wfw:commentRss>http://www.philvarner.com/blog/2009/10/26/unicode-in-java-sample-dat-part-5/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Unicode in Java: Default Charset (part 4)</title>
		<link>http://www.philvarner.com/blog/2009/10/24/unicode-in-java-default-charset-part-4/</link>
		<comments>http://www.philvarner.com/blog/2009/10/24/unicode-in-java-default-charset-part-4/#comments</comments>
		<pubDate>Sat, 24 Oct 2009 23:36:26 +0000</pubDate>
		<dc:creator>Phil</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.philvarner.com/blog/?p=250</guid>
		<description><![CDATA[In this part, I will discuss the default Charset and how to change it.
The default character set (technically a character encoding) is set when the JVM starts.  Every platform has a default default, but the default can also be configured explicitly. For example, Windows XP 32 bit (English) defaults to &#034;windows-1252&#034;, which is the [...]]]></description>
			<content:encoded><![CDATA[<p>In this part, I will discuss the default Charset and how to change it.</p>
<p>The default character set (technically a character encoding) is set when the JVM starts.  Every platform has a default default, but the default can also be configured explicitly. For example, Windows XP 32 bit (English) defaults to &#034;windows-1252&#034;, which is the CP1252 encoding that provides for encoding most Western European languages.</p>
<p>The default charset can be printed by calling:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #003399;">System</span>.<span style="color: #006633;">out</span>.<span style="color: #006633;">println</span><span style="color: #009900;">&#40;</span>java.<span style="color: #006633;">nio</span>.<span style="color: #006633;">charset</span>.<span style="color: #006633;">Charset</span>.<span style="color: #006633;">defaultCharset</span><span style="color: #009900;">&#40;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p>When the JVM is started, the default charset can be set with the property &#034;file.encoding&#034;, e.g., &#034;-Dfile.encoding=utf-8&#034;. Some IDEs will do this automatically, for example, NetBeans uses this property to explicitly set the charset to UTF-8.  The drawback to this is that code that uses a class like FileReader that relies on the default encoding may work correctly when handling Unicode in the development environment, but then break when used in an environment that has a different default encoding.  The developer should not rely on the user to set the encoding for the code to work correctly. </p>
<p>Also, one might think they could just alter the system property &#034;file.encoding&#034; programmatically. However, this cannot be set after the JVM starts, as by that time all of the system classes which rely on this value have already cached it.</p>
<p>In Linux/Unix, you can also set the LC_ALL to affect the default encoding.  For example, on one Linux box I have, the default is US-ASCII.  When I set &#034;export LC_ALL=en_US.UTF-8&#034;, the default encoding is UTF8.</p>
<p>The environment variables LANG and LC_CTYPE will also have a similar affect (more <a href="http://opengroup.org/onlinepubs/007908799/xbd/envvar.html">here</a>).</p>
<p>In summary, the default charset is used by many classes when a character set is not explicitly specified, but this charset should not be relied upon to work correctly when your application is supposed to handle Unicode.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.philvarner.com/blog/2009/10/24/unicode-in-java-default-charset-part-4/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Unicode in Java: Readers and Writers (part 3)</title>
		<link>http://www.philvarner.com/blog/2009/10/21/unicode-in-java-readers-and-writers-part-3/</link>
		<comments>http://www.philvarner.com/blog/2009/10/21/unicode-in-java-readers-and-writers-part-3/#comments</comments>
		<pubDate>Thu, 22 Oct 2009 05:42:42 +0000</pubDate>
		<dc:creator>Phil</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.philvarner.com/blog/?p=269</guid>
		<description><![CDATA[In the previous parts, I&#039;ve discussed Unicode, encodings, and which encodings are used for Java internally.  In this part, I&#039;ll discuss using Readers and Writers in a Unicode-compliant way.  In short, never use FileReader or FileWriter.  This is a particularly important thing to understand because I don&#039;t feel any of the Java [...]]]></description>
			<content:encoded><![CDATA[<p>In the previous parts, I&#039;ve discussed Unicode, encodings, and which encodings are used for Java internally.  In this part, I&#039;ll discuss using Readers and Writers in a Unicode-compliant way.  In short, never use FileReader or FileWriter.  This is a particularly important thing to understand because I don&#039;t feel any of the Java books I have stated this explicitly enough so that I understood it until I encountered it in the field.</p>
<p>The various Reader and Writer classes in Java almost never to the correct thing by default. Not because they&#039;re not well-designed, but because it&#039;s largely up to the user to specify what &#034;the correct thing&#034; is.  For example, FileReader and FileWriter will <strong>always</strong> use the default character encoding.  This varies widely between platforms, for example, Windows XP 32-bit defaults to CP1252 (a variant of <a href="http://en.wikipedia.org/wiki/ISO/IEC_8859-1">ISO-8859-1</a>), many Linuxes default to US-ASCII, and MacOS X defaults to MacRoman.  If you expect your users to input Unicode characters, this will always cause them to be garbled.  It is possible to change the default character encoding (which we&#039;ll discuss later), but you shouldn&#039;t rely on your users to set their environments up in a certain way, particularly when your users are non-technical.</p>
<p>If your application has control over a set of flies, it needs to explicitly specify the character encoding and always use that encoding.  Instead of using FileReader and FileWriter, you must use InputStreamReader and OutputStreamWriter with the constructors that take stream and a charset name string, e.g. &#034;UTF-8&#034;. This is a bit confusing, since it is referred to as a &#034;charset&#034;, even though it&#039;s technically a character encoding.  Here is what the code should look like:</p>

<div class="wp_syntax"><div class="code"><pre class="java" style="font-family:monospace;"><span style="color: #003399;">InputStream</span> istream <span style="color: #339933;">=</span> ...<span style="color: #339933;">;</span>
<span style="color: #003399;">BufferedReader</span> reader <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">BufferedReader</span><span style="color: #009900;">&#40;</span><span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">InputStreamReader</span><span style="color: #009900;">&#40;</span>istream, <span style="color: #0000ff;">&quot;UTF-8&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span>
&nbsp;
<span style="color: #003399;">OutputStream</span> ostream <span style="color: #339933;">=</span> ...<span style="color: #339933;">;</span>
<span style="color: #003399;">Writer</span> writer <span style="color: #339933;">=</span> <span style="color: #000000; font-weight: bold;">new</span> <span style="color: #003399;">OutputStreamWriter</span><span style="color: #009900;">&#40;</span>ostream, <span style="color: #0000ff;">&quot;UTF-8&quot;</span><span style="color: #009900;">&#41;</span><span style="color: #339933;">;</span></pre></div></div>

<p>If you&#039;re reading an writing files, you can use the FileOutputStream and FileInputStream implementations for the InputStream and OutputStream instances.  The *Stream classes only read and write bytes, so it&#039;s the Reader that actually tries to apply an encoding to map the bytes to chars or vice versa.  You can pretty much just grep your code for FileReader and FileWriter to find places where support for Unicode will break.  </p>
<p>The javadoc for these classes isn&#039;t much help unless you&#039;re already aware of the issues.  The FileOutputStream javadoc says &#034;FileOutputStream is meant for writing streams of raw bytes such as image data. For writing streams of characters, consider using FileWriter. &#034;  This is misleading, since if you&#039;re naive to the issues with Unicode support, you might think that FileWriter will &#034;just work&#034; if your code expects to handle Unicode.  The FileWriter javadoc says &#034;The constructors of this class assume that the default character encoding and the default byte-buffer size are acceptable.&#034;  If you know what that means, you&#039;re okay.  But a more useful warning would be &#034;This will almost never write anything other than American English correctly, so don&#039;t use it!&#034;.  I say American English because, for example, the British pound symbol £ isn&#039;t included in ASCII.</p>
<p>Now, go and find all of the places in your code where this is broken and fix it.</p>
<p>In the next part, I&#039;ll discuss more about the default character set.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.philvarner.com/blog/2009/10/21/unicode-in-java-readers-and-writers-part-3/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Unicode in Java: primitives and encodings (part 2)</title>
		<link>http://www.philvarner.com/blog/2009/10/20/unicode-in-java-primitives-and-encodings-part-2/</link>
		<comments>http://www.philvarner.com/blog/2009/10/20/unicode-in-java-primitives-and-encodings-part-2/#comments</comments>
		<pubDate>Wed, 21 Oct 2009 06:06:20 +0000</pubDate>
		<dc:creator>Phil</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.philvarner.com/blog/?p=231</guid>
		<description><![CDATA[In the last part, I discussed how Unicode is a consistent naming scheme for graphemes, how character encodings such as UTF-8 map Unicode code points to bits, and how fonts describe how code points should be visually displayed.  In this part, I discuss the specific things you need to know about using Unicode in [...]]]></description>
			<content:encoded><![CDATA[<p>In the last <a href="http://www.philvarner.com/blog/2009/10/19/unicode-in-java-part-1/">part</a>, I discussed how Unicode is a consistent naming scheme for graphemes, how character encodings such as UTF-8 map Unicode code points to bits, and how fonts describe how code points should be visually displayed.  In this part, I discuss the specific things you need to know about using Unicode in Java code.</p>
<h3>Java primitives and Unicode</h3>
<p>The two most commonly used character encodings for Unicode are UTF-8 and UTF-16.  Java uses UTF-16 for char values, and as a result for Strings, since these are just an object wrapper for a char array.  UTF-8 is most commonly used when writing files, particularly XML.  UTF-16 stores nearly all characters as a sequence of 16 bits, even the ones that could be stored in only 8 bits (e.g., characters in the ASCII range).  UTF-8 uses a variable-length encoding scheme that stores ASCII-range characters in 8 bits and other characters in 2 to 6 bytes, depending on the character.  For example, the letter &#034;a&#034; (Latin small letter a, U+0061) is represented with 8 bits; &#034;á&#034; (Latin small letter A with acute, U+00E1) is represented with 16 bits, and our beloved snowman (☃) is represented with 24 bits. As I mentioned before, files encoded using ASCII can be read as if they were encoded using UTF-8, and files written using UTF-8 that only contain characters in the ASCII range can be read by Unicode-ignorant programs as if they were ASCII (usually).  UTF-16 uses a similar variable-width encoding as UTF-8, but uses increments of 16 bits instead of 8. </p>
<h3>From bytes to Strings</h3>
<p>The character encoding describes how to map a byte array (byte[]) to a char array (char[]), and vice versa.  Strings are just wrappers around char[]s, so this applies to Strings also.  The important thing with the mapping is how it describes instances when more than one byte in the array maps to a single char value.  This allows a char to represent any Unicode code point from U+0000 to U+FFFF.  This range is known as the <a href="http://en.wikipedia.org/wiki/Basic_multilingual_plane">Basic Multilingual Plane</a> and includes every language that a general-purpose Java application can be expected to support.  If your app needs to support Cuneiform or Phoenician, you probably need to read something other than a blog post.</p>
<h3>Encoding support</h3>
<p>Every Java implementation must support US-ASCII, ISO-8859-1, UTF-8, UTF-16BE, UTF-16LE, and UTF-16 (with byte order mark).  US-ASCII and UTF-8 you should recognize.  ISO-8859-1 is commonly referred to as Latin-1 and is usually used when only &#034;Western European&#034; languages needed to be supported. It&#039;s related to the Windows-1252 encoding used by default on older Windows OSes. UTF-16BE and UTF-16LE encode either as big endian or little endian, which will give a speedup for certain platforms.  The default UTF-16 scheme includes the code point  U+FEFF as the first two bytes of a document (called the <a href="http://en.wikipedia.org/wiki/Byte_Order_Mark">byte order mark</a>), the order of which determines if the rest of the document is big endian or little endian.</p>
<p>However, most Java implementations support a lot more. For instance, MacOS X Java 6 supports: Big5, Big5-HKSCS, EUC-JP, EUC-KR, GB18030, GB2312, GBK, IBM-Thai, IBM00858, IBM01140, IBM01141, IBM01142, IBM01143, IBM01144, IBM01145, IBM01146, IBM01147, IBM01148, IBM01149, IBM037, IBM1026, IBM1047, IBM273, IBM277, IBM278, IBM280, IBM284, IBM285, IBM297, IBM420, IBM424, IBM437, IBM500, IBM775, IBM850, IBM852, IBM855, IBM857, IBM860, IBM861, IBM862, IBM863, IBM864, IBM865, IBM866, IBM868, IBM869, IBM870, IBM871, IBM918, ISO-2022-CN, ISO-2022-JP, ISO-2022-JP-2, ISO-2022-KR, ISO-8859-1, ISO-8859-13, ISO-8859-15, ISO-8859-2, ISO-8859-3, ISO-8859-4, ISO-8859-5, ISO-8859-6, ISO-8859-7, ISO-8859-8, ISO-8859-9, JIS_X0201, JIS_X0212-1990, KOI8-R, KOI8-U, MacRoman, Shift_JIS, TIS-620, US-ASCII, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, UTF-8, windows-1250, windows-1251, windows-1252, windows-1253, windows-1254, windows-1255, windows-1256, windows-1257, windows-1258, windows-31j, x-Big5-Solaris, x-euc-jp-linux, x-EUC-TW, x-eucJP-Open, x-IBM1006, x-IBM1025, x-IBM1046, x-IBM1097, x-IBM1098, x-IBM1112, x-IBM1122, x-IBM1123, x-IBM1124, x-IBM1381, x-IBM1383, x-IBM33722, x-IBM737, x-IBM834, x-IBM856, x-IBM874, x-IBM875, x-IBM921, x-IBM922, x-IBM930, x-IBM933, x-IBM935, x-IBM937, x-IBM939, x-IBM942, x-IBM942C, x-IBM943, x-IBM943C, x-IBM948, x-IBM949, x-IBM949C, x-IBM950, x-IBM964, x-IBM970, x-ISCII91, x-ISO-2022-CN-CNS, x-ISO-2022-CN-GB, x-iso-8859-11, x-JIS0208, x-JISAutoDetect, x-Johab, x-MacArabic, x-MacCentralEurope, x-MacCroatian, x-MacCyrillic, x-MacDingbat, x-MacGreek, x-MacHebrew, x-MacIceland, x-MacRomania, x-MacSymbol, x-MacThai, x-MacTurkish, x-MacUkraine, x-MS932_0213, x-MS950-HKSCS, x-mswin-936, x-PCK, x-SJIS_0213, x-UTF-16LE-BOM, X-UTF-32BE-BOM, X-UTF-32LE-BOM, x-windows-50220, x-windows-50221, x-windows-874, x-windows-949, x-windows-950, x-windows-iso2022jp.</p>
<p>In the next part, I&#039;ll discuss using Readers and Writers with Unicode.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.philvarner.com/blog/2009/10/20/unicode-in-java-primitives-and-encodings-part-2/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Unicode in Java: introduction (part 1)</title>
		<link>http://www.philvarner.com/blog/2009/10/19/unicode-in-java-part-1/</link>
		<comments>http://www.philvarner.com/blog/2009/10/19/unicode-in-java-part-1/#comments</comments>
		<pubDate>Tue, 20 Oct 2009 04:37:05 +0000</pubDate>
		<dc:creator>Phil</dc:creator>
				<category><![CDATA[Uncategorized]]></category>

		<guid isPermaLink="false">http://www.philvarner.com/blog/?p=220</guid>
		<description><![CDATA[The bad old days
A long time ago, things were much easier for programmers.  The only computers anyone cared about were in the US, and these computers only needed to render &#034;normal&#034; letters like &#034;a&#034; and &#034;Q&#034;.  Then the internet came along, and we realized that there were all of these other people in [...]]]></description>
			<content:encoded><![CDATA[<h3>The bad old days</h3>
<p>A long time ago, things were much easier for programmers.  The only computers anyone cared about were in the US, and these computers only needed to render &#034;normal&#034; letters like &#034;a&#034; and &#034;Q&#034;.  Then the internet came along, and we realized that there were all of these other people in the world that had other languages with crazy letters like ð and ß and བོ , and even symbols that represent entire words like 中 and 말.</p>
<p>Back then, most programmers only needed to worry about [0-9a-zA-Z], these were most commonly represented as <a href="http://en.wikipedia.org/wiki/Ascii">ASCII</a>.  All of the characters were encoded as 7 bits and padded with one extra bit to make an 8 bit sequence, so only a total of 128 characters were represented.  </p>
<p>Unfortunately, 8 bits can&#039;t represent the thousands of basic units of a language used throughout the world. We use the word grapheme to describe these basic units because they vary widely between languages. For example, in English this could be a letter like &#034;A&#034; and in Chinese it could be an ideograph like 中.  Before Unicode, there were dozens of other schemes in common use that covered different subsets of the problem, but none of which provided a unified approach.  For example, ISO 8859-1 and ISO 8859-2 were commonly used for Western European languages that use diacritics (commonly called &#034;accented&#034; characters); ISO 8859-7 for Greek; KOI-8, ISO 8859-5, and CP1251 for Cyrillic alphabets (e.g., Russian and Ukranian); EUC and Shift-JIS for Japanese; BIG5 for traditional Chinese characters (Taiwan); GB for simplified Chinese characters (China).  </p>
<p>If you wanted to mix these together in the same text string, good luck. </p>
<h3>Unicode to the rescue</h3>
<p>To solve this issue, Unicode and series of encodings were created. Unicode is only a consistent way of naming the graphemes and does not describe how they should be encoded into a bit pattern.  </p>
<p>Each Unicode character is referred to by a four digit number prefixed by &#034;U+&#034;, so &#034;A&#034; is represented by U+0041 and described as &#034;LATIN CAPITAL LETTER A&#034;, and U+2603 is &#034;SNOWMAN&#034; (not kidding: ☃).  ASCII had so few characters that the description of which character is which and the bit encoding of the characters aren&#039;t separated.  In Unicode they are, so you don&#039;t have to describe the Icelandic character ð as &#034;that d with the slash in it&#034;, and can instead refer to it by a standardized code, U+00F0.  It gets even messier when referring to some Asian languages that share what are essentially the same grapheme, but written in different ways (see <a href="http://en.wikipedia.org/wiki/Han_unification">Han unification</a>).  There are also a significant number of symbol-like things in Unicode, so the casual observer would not be able tell ☸  (wheel of dharma, U+2638) from ⎈ (helm symbol, U+2388).  Unicode makes it very explicit which grapheme is which.</p>
<p>To reiterate, Unicode doesn&#039;t describe how the character should be represented in bits (encoded) nor does it describe what the character should actually look like when displayed.  It&#039;s only providing a mapping between numbers (called code points) like U+0041 and U+2603 and abstract things, like English letters, Chinese ideographs, and snowpersons.</p>
<h3>Character encoding</h3>
<p>The next issue is, how to we physically store these Unicode code points as bits?  This is referred to as a character encoding, and describes a mapping between the code points and a sequence of bits (although it probably should be referred to as grapheme encoding).  In ASCII, each character is stored in 8 bits, but 8 bits limit the number of characters that can be represented to 256.  To represent the thousands of Unicode code points, we need to have an encoding that uses more than 8 bits.  However, we already have millions of files that are encoded in 8 bits with ASCII. Ideally we&#039;d like our new encoding to be backwards compatible, so we don&#039;t have our legacy ASCII files garbled if they were read as if they were in our new encoding.  This is where UTF-8 comes in.</p>
<p>UTF-8 is an encoding for Unicode code points, hence its acronym Unicode Transformation Format.  UTF-8 is known as a variable-length encoding because some code points are represented by 8 bits and others by 16 bits (or more).  The cool thing is that all of the characters which can be represented in ASCII have the same bit encodings in ASCII and UTF-8, so trying to read an ASCII-encoded file as UTF-8 will just work. Trying to read a UTF-8 encoded file as if it were ASCII (as many Unicode-ignorant programs do) results in characters encoded in 16 bits being read as if they were two 8 bit characters, so instead of a Chinese character, you get a capital Q and a ASCII beep.</p>
<p>UTF-16 is similar to UTF-8, but instead of encoding characters as multiples of 8 bits, all characters are encoded as multiples of 16 bits.  The drawbacks here are that if the text primarily consists of characters in the ASCII range, it takes up twice the amount of storage space.  Also, files which mostly contain mostly ASCII can&#039;t be read at all in editors which don&#039;t understand ASCII, rather than just incorrectly displaying characters outside of the ASCII range.</p>
<h3>Fonts</h3>
<p>The final piece of this is fonts.  A font describes how a character (code point) should be displayed on the screen.  Useful fonts look like glyphs people recognize.  Before Unicode was prevalent and we could use U+2620 to represent a skull and crossbones (☠), there were fonts like Wingdings that displayed a symbol in place of a letter.  For example, &#034;N&#034; in wingdings is a skull and crossbones, but it&#039;s still (technically) an N, it&#039;s just no one would recognize it as such.  It&#039;s very important to recognize the difference between the code point, the character encoding, and the font describing the visual display.</p>
<p>In the next part, we&#039;ll discuss how Unicode and character encodings are used in Java.</p>
<h3>Additional Resources</h3>
<p>Joel Spolsky&#039;s <a href="http://www.joelonsoftware.com/articles/Unicode.html">great intro to Unicode in general</a>, which sounds a lot like this post<br />
Jukka K. Korpela&#039;s <a href="http://www.cs.tut.fi/~jkorpela/chars.html">tutorial on character code issues</a></p>
]]></content:encoded>
			<wfw:commentRss>http://www.philvarner.com/blog/2009/10/19/unicode-in-java-part-1/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
	</channel>
</rss>
