Tuesday, January 27, 2015

Deploy Tez Based On Hadoop-2.2.0

Tez is a computing engine parallel to MapReduce, whose target is to build an application framework which allows for a complex directed-acyclic-graph (DAG) of tasks for processing data. It is currently built atop Apache Hadoop YARN.

The most significant advantage of Tez against MapReduce is that Disk IO will be saved when there's multiple MR tasks which are to be executed in series in Hive. This in-memory computing mechanism is somewhat like Spark.

Now, the procedure of deploying Tez on Hadoop-2.2.0 is shown as below.

--CAVEAT--
1. The Official Deploy Instruction For Tez is absolutely suitable for all release versions of Tez, except the incubating version. Thus, the following deploy instruction is not exactly the same as the official one. (Supplementary may sound more appropriate)
2. In the official document, it says that we have to change hadoop.version to our currently-using version, which is not true after verifying. For instance, there will be ERRORs when execute `mvn clean package ...` provided we change hadoop.version from 2.6.0 to 2.2.0 forcibly in Tez-0.6.0. Consequently, we have to use tez-0.4.1-incubating whose default setting of hadoop.version is 2.2.0.

Ok, now let's get back on track!

Firstly, we have to install JDK6 or later, Maven 3 or later and Protocol Buffers (protoc compiler) 2.5 or later as prerequisite, whose procedure is omitted.

Retrieve tez-0.4.1-incubating from official website and decompress it:
wget http://archive.apache.org/dist/incubator/tez/tez-0.4.1-incubating/tez-0.4.1-incubating-src.tar.gz
tar xzf tez-0.4.1-incubating-src.tar.gz

Check hadoop.version, protobuf.version and 'hardcode' protoc.path as is shown below:
<properties>
    <maven.test.redirectTestOutputToFile>true</maven.test.redirectTestOutputToFile>
    <clover.license>${user.home}/clover.license</clover.license>
    <hadoop.version>2.2.0</hadoop.version>
    <jetty.version>7.6.10.v20130312</jetty.version>
    <distMgmtSnapshotsId>apache.snapshots.https</distMgmtSnapshotsId>
    <distMgmtSnapshotsName>Apache Development Snapshot Repository</distMgmtSnapshotsName>
    <distMgmtSnapshotsUrl>https://repository.apache.org/content/repositories/snapshots</distMgmtSnapshotsUrl>
    <distMgmtStagingId>apache.staging.https</distMgmtStagingId>
    <distMgmtStagingName>Apache Release Distribution Repository</distMgmtStagingName>
    <distMgmtStagingUrl>https://repository.apache.org/service/local/staging/deploy/maven2</distMgmtStagingUrl>
    <failIfNoTests>false</failIfNoTests>
    <protobuf.version>2.5.0</protobuf.version>
    <protoc.path>/usr/local/bin/protoc</protoc.path>
    <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
    <scm.url>scm:git:https://git-wip-us.apache.org/repos/asf/incubator-tez.git</scm.url>
  </properties>

Execute maven package command.
mvn clean package -DskipTests=true -Dmaven.javadoc.skip=true

After building, we could find all the compiled jar files in '$TEZ_HOME/tez-dist/target/tez-0.4.1-incubating-full/tez-0.4.1-incubating-full/', assuming which as environment variable $TEZ_JARS.

Find a HDFS path, in which $TEZ_JARS will be uploaded. In my case, '/user/supertool/zhudi/tez-dist' is applied.
hadoop fs -copyFromLocal $TEZ_JARS /user/supertool/zhudi/tez-dist

Create a tez-site.xml in '$HADOOP_HOME/etc/hadoop', add the following content which refers to the HDFS path. Be sure that the HDFS path is in full-path format, that is to say, with 'hdfs://ns1' header.
 <configuration>
     <property>
         <name>tez.lib.uris</name>
        <value>hdfs://ns1/user/supertool/zhudi/tez-dist/tez-0.4.1-incubating-full,hdfs://ns1/user/supertool/zhudi/tez-dist/tez-0.4.1-incubating-full/lib</value>
     </property>
 </configuration>

Eventually, add the following content to ~/.bashrc and `source ~/.bashrc`.
 export TEZ_CONF_DIR=/home/workspace/tez-0.4.1-incubating-src
 export TEZ_JARS=/home/workspace/tez-0.4.1-incubating-src/tez-dist/target/tez-0.4.1-incubating-full/tez-0.4.1-incubating-full
 export HADOOP_CLASSPATH=${TEZ_CONF_DIR}:${TEZ_JARS}/*:${TEZ_JARS}/lib/*

We could run the tez-examples.jar, which is a MapReduce task, for testing:
hadoop jar /home/workspace/tez-0.4.1-incubating-src/tez-mapreduce-examples/target/tez-mapreduce-examples-0.4.1-incubating.jar orderedwordcount /user/supertool/zhudi/mrTest/input /user/supertool/zhudi/mrTest/output

For hive, simply add the following command before executing HQL.
set hive.execution.engine=tez;

If 'hive.input.format' need to be specified when applying MapReduce Computing Engine, which is default, remember to append the following command when switching to Tez:
set hive.input.format=com.XXX.RuntimeCombineHiveInputFormat;
set hive.tez.input.format=com.XXX.RuntimeCombineHiveInputFormat;

Likewise, if 'mapred.job.queue.name' need to be specified, replace it with 'tez.queue.name'.


One last thing: Only the gateway node, which is going to submit tasks using Tez, in Hadoop cluster needs to be deployed.


Possible ERROR #1:
When using custom UDF in hive/tez, there are times that the exactly same task failed whereas in other times, it succeeded. After looking through the detailed log retrieved by `yarn logs -applicationId <app_id>`, the following ERROR could be found:
java.lang.NoSuchMethodError: org.apache.commons.collections.CollectionUtils.isEmpty(Ljava/util/Collection;)Z
at com.XXX.inputformat.hive.SplitInfo.mergeSplitFiles(SplitInfo.java:86)
at com.XXX.inputformat.hive.RuntimeCombineHiveInputFormat.getSplits(RuntimeCombineHiveInputFormat.java:105)
at org.apache.tez.mapreduce.hadoop.MRHelpers.generateOldSplits(MRHelpers.java:263)
at org.apache.tez.mapreduce.hadoop.MRHelpers.generateInputSplitsToMem(MRHelpers.java:379)
at org.apache.tez.mapreduce.common.MRInputAMSplitGenerator.initialize(MRInputAMSplitGenerator.java:161)
at org.apache.tez.dag.app.dag.RootInputInitializerRunner$InputInitializerCallable$1.run(RootInputInitializerRunner.java:154)
at org.apache.tez.dag.app.dag.RootInputInitializerRunner$InputInitializerCallable$1.run(RootInputInitializerRunner.java:146)

Then I looked into $HADOOP_HOME/share/hadoop/common/lib/ and $HIVE_HOME/lib, finding that the version of commons-collections.jar is 3.2.1 and 3.1 respectively. Then I found out that there is no 'org.apache.commons.collections.CollectionUtils.isEmpty' method in version 3.1. It is obvious that the culprit is maven dependency confliction. Thus, I replaced the 3.1 with 3.2.1 and all things just worked out fine.


References:
1. Official Deploy Instruction For Tez
2. Deploy Tez on Hadoop 2.2.0 - CSDN


© 2014-2017 jason4zhu.blogspot.com All Rights Reserved 
If transfering, please annotate the origin: Jason4Zhu

No comments:

Post a Comment