Tuesday, December 2, 2014

A Charset Encoding Problem Related With 'file.encoding' And 'sun.jnu.encoding' Parameters In Java When Executing Mapreduce Job On A Datanode

When executing MapReduce job, we found that there are some Chinese data in log is displayed as '??????', which eventually affects our final results.

We located to the suspicious DataNode and run the following Java program so as to check out the runtime encoding-related parameters.
// --FileEncodingTest.java--

import java.util.*;
import java.net.URLDecoder;
public class FileEncodingTest{
public static void main(String[] args) throws Exception {
        Properties properties = System.getProperties();
        for (Object key : properties.keySet()) {
            System.out.println(String.format("propertyName: %s, propertyValue: %s", key, properties.getProperty(key.toString())));
        }
        String originContent = "%E4%B9%90%E8%A7%86%E8%A7%86%E9%A2%91";
        System.out.println(URLDecoder.decode(originContent, "utf-8"));

    }
}

// --Shell Command--
$ javac FileEncodingTest.java
$ java FileEncodingTest | grep encod --color
propertyName: file.encoding.pkg, propertyValue: sun.io
propertyName: sun.jnu.encoding, propertyValue: ANSI_X3.4-1968
propertyName: file.encoding, propertyValue: ANSI_X3.4-1968
propertyName: sun.io.unicode.encoding, propertyValue: UnicodeLittle

We could see that the default 'sun.jnu.encoding' and 'file.encoding' parameters are "ANSI_X3.4-1968", which is not exactly what we expect.

Curiously enough, default value of the above two parameters are "UTF-8" when my colleague ssh to the same DataNode from his own machine. It is relevant to the configuration of ssh-client machine!

We found in this post that all the LC_* parameters in local machine will be carried to the remote node provided they are not explicitly set in the remote node. After checking `locale` in local machine, the trouble spot is pinpointed.
# --command invoked in local machine--
$ locale
LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=

# --command invoked in remote DataNode--
$ locale
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: ?????????
LANG=zh_CN.UTF-8
LC_CTYPE=UTF-8
LC_NUMERIC="zh_CN.UTF-8"
LC_TIME="zh_CN.UTF-8"
LC_COLLATE="zh_CN.UTF-8"
LC_MONETARY="zh_CN.UTF-8"
LC_MESSAGES="zh_CN.UTF-8"
LC_PAPER="zh_CN.UTF-8"
LC_NAME="zh_CN.UTF-8"
LC_ADDRESS="zh_CN.UTF-8"
LC_TELEPHONE="zh_CN.UTF-8"
LC_MEASUREMENT="zh_CN.UTF-8"
LC_IDENTIFICATION="zh_CN.UTF-8"
LC_ALL=

As we can see from above, 'LC_CTYPE' in local machine is set to "UTF-8", which is decent in MAC OS X. When it is carried to remote DataNode, it is not recognized by CentOS, thus the above "LC_CTYPE=UTF-8" doesn't make any sense, 'ANSI_X3.4-1968' is applied when running Java program.

There are two ways to solve this problem, the first one is less competitive than the second one:

#1. Set the above two java parameter explicitly every time we run a Java program, which is presented in this post:
java -Dsun.jnu.encoding=UTF-8 -Dfile.encoding=UTF-8 FileEncodingTest | grep encod --color

#2. Set locale parameters explicitly so as to make it not ssh-client-env-related. The only thing we need to do is to append the following content in "/etc/profile":
$ su root
$ vim /etc/profile
export LANG=en_US.UTF-8
export LC_CTYPE=en_US.UTF-8
export LC_NUMERIC=en_US.UTF-8
export LC_TIME=en_US.UTF-8
export LC_COLLATE=en_US.UTF-8
export LC_MONETARY=en_US.UTF-8
export LC_MESSAGES=en_US.UTF-8
export LC_PAPER=en_US.UTF-8
export LC_NAME=en_US.UTF-8
export LC_ADDRESS=en_US.UTF-8
export LC_TELEPHONE=en_US.UTF-8
export LC_MEASUREMENT=en_US.UTF-8
export LC_IDENTIFICATION=en_US.UTF-8
export LC_ALL=en_US.UTF-8

In this time, the above two java parameters will alway be set to "UTF-8" when running java programs no matter from which client we ssh to the DataNode.



© 2014-2017 jason4zhu.blogspot.com All Rights Reserved 
If transfering, please annotate the origin: Jason4Zhu

No comments:

Post a Comment