Wednesday, November 12, 2014

Dig Into JobHistory Server Of MapReduce In Hadoop2

JobHistory Server is a standalone module in hadoop2, and will be started or stopped separately apart from start-all.sh and stop-all.sh. It serves as the job history logger, which will log down all the info in configured filesystem from the birth of a MapReduce task to its death.

JobHistory logs can be found from the page shown below:

Configuration & Command

There are two arguments related to the startup and monitor-page of jobhistory:
<property>
    <name>mapreduce.jobhistory.address</name>
    <value>host:10020</value>
</property>
<property>
    <name>mapreduce.jobhistory.webapp.address</name>
    <value>host:19888</value>
</property>

And another three arguments related to the storaging path of job history files:
name value description
yarn.app.mapreduce.am.staging-dir /tmp/hadoop-yarn/staging The staging dir used while submitting jobs.
mapreduce.jobhistory.intermediate-done-dir ${yarn.app.mapreduce.am.staging-dir}/history/done_intermediate
mapreduce.jobhistory.done-dir ${yarn.app.mapreduce.am.staging-dir}/history/done
We'd better `mkdir` and `chmod` of the above three directories ourselves.
hadoop  fs  -mkdir  -p  /tmp/hadoop-yarn/staging/history/done_intermediate
hadoop  fs  -chmod  -R  777  /tmp/hadoop-yarn/staging/history/done_intermediate
hadoop  fs  -mkdir  -p  /tmp/hadoop-yarn/staging/history/done
hadoop  fs  -chmod  -R  777  /tmp/hadoop-yarn/staging/history/done

The command to start and stop JobHistory Server is quite easy:
${HADOOP_HOME}/sbin/mr-jobhistory-daemon.sh  start historyserver
${HADOOP_HOME}/sbin/mr-jobhistory-daemon.sh  stop  historyserver

Procedure of Logging in History Server

When a MapReduce application starts, history server will write logs in ${yarn.app.mapreduce.am.staging-dir}/${current_user}/.staging/job_XXXXX_XXX, in which there are three files: .jhist, .summary and .xml, representing job history, job summary and configuration file, respectively.

When this application is finished/killed/failed, the log info will be copied to ${mapreduce.jobhistory.intermediate-done-dir}/${current_user}. This procedure is implemented at "org.apache.hadoop.mapreduce.jobhistory.JobHistoryEventHandler".

After copied to ${mapreduce.jobhistory.intermediate-done-dir}/${current_user},  The job history file will eventually be moved to ${mapreduce.jobhistory.done-dir} by "org.apache.hadoop.mapreduce.v2.hs.HistoryFileManager".

All logs for this procedure will be recorded to ${HADOOP_HOME}/logs/userlogs, which is configured by argument 'yarn.nodemanager.log-dirs', in each NodeManager node provided yarn-log-aggregation is not enabled.

NullPointerException With History Server

We're facing a problem that some of our MapReduce tasks, especially for long time-consuming tasks, will throw NullPointerException when the job completes, the stacktrace is as follows:
14/07/22 06:37:11 INFO mapreduce.Job:  map 100% reduce 98%
14/07/22 06:37:44 INFO mapreduce.Job:  map 100% reduce 99%
14/07/22 06:38:30 INFO mapreduce.Job:  map 100% reduce 100%
14/07/22 06:39:02 INFO mapred.ClientServiceDelegate: Application state is
completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history
server
14/07/22 06:39:02 INFO mapred.ClientServiceDelegate: Application state is
completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history
server
14/07/22 06:39:02 INFO mapred.ClientServiceDelegate: Application state is
completed. FinalApplicationStatus=SUCCEEDED. Redirecting to job history
server
14/07/22 06:39:02 ERROR security.UserGroupInformation:
PriviledgedActionException as: rohitsarewar (auth:SIMPLE)
cause:java.io.IOException:
org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException):
java.lang.NullPointerException
        at
org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler.getTaskAttemptCompletionEvents(HistoryClientService.java:269)
        at
org.apache.hadoop.mapreduce.v2.api.impl.pb.service.MRClientProtocolPBServiceImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBServiceImpl.java:173)
        at
org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$2.callBlockingMethod(MRClientProtocol.java:283)
        at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2053)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2047)
Exception in thread "main" java.io.IOException:
org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException):
java.lang.NullPointerException
        at
org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolHandler.getTaskAttemptCompletionEvents(HistoryClientService.java:269)
        at
org.apache.hadoop.mapreduce.v2.api.impl.pb.service.MRClientProtocolPBServiceImpl.getTaskAttemptCompletionEvents(MRClientProtocolPBServiceImpl.java:173)
        at
org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$2.callBlockingMethod(MRClientProtocol.java:283)
        at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:585)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2053)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2047)
        at
org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDelegate.java:330)
        at
org.apache.hadoop.mapred.ClientServiceDelegate.getTaskCompletionEvents(ClientServiceDelegate.java:382)
        at
org.apache.hadoop.mapred.YARNRunner.getTaskCompletionEvents(YARNRunner.java:529)
        at org.apache.hadoop.mapreduce.Job$5.run(Job.java:668)
        at org.apache.hadoop.mapreduce.Job$5.run(Job.java:665)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
        at
org.apache.hadoop.mapreduce.Job.getTaskCompletionEvents(Job.java:665)
        at org.apache.hadoop.mapreduce.Job.monitorAndPrintJob(Job.java:1349)
        at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1289)
        at com.bigdata.mapreduce.esc.escDriver.main(escDriver.java:23)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:606)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:212)
Caused by:
org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): j
        at
org.apache.hadoop.mapreduce.v2.hs.HistoryClientService$HSClientProtocolH
        at
org.apache.hadoop.mapreduce.v2.api.impl.pb.service.MRClientProtocolPBSer
        at
org.apache.hadoop.yarn.proto.MRClientProtocol$MRClientProtocolService$2.
        at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(P
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:928)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2053)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2049)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:396)
        at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformatio
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2047)
        at org.apache.hadoop.ipc.Client.call(Client.java:1347)
        at org.apache.hadoop.ipc.Client.call(Client.java:1300)
        at
org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine
        at com.sun.proxy.$Proxy12.getTaskAttemptCompletionEvents(Unknown
Source)
        at
org.apache.hadoop.mapreduce.v2.api.impl.pb.client.MRClientProtocolPBClie
        at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
        at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessor
        at java.lang.reflect.Method.invoke(Method.java:606)
        at
org.apache.hadoop.mapred.ClientServiceDelegate.invoke(ClientServiceDeleg
        ... 16 more

It seems like the remote history server object can not be found after the MapReduce job is done and try to invoke method on that object via IPC. Finally, we solve this problem by changing argument 'dfs.client.socket-timeout' for JobHistory service to '3600000', which is 1 hour. Because of high pressure on our HDFS cluster, there will be some delay or hanging when sending request to HDFS, thus we have to set this argument separately for JobHistory service. 

Notice that argument 'dfs.client.socket-timeout' in hdfs-site.xml for start/stop-dfs.sh should be relatively lower than '3600000', say '60000' or '180000'. Since a map has to wait exactly that much time in order to try again provided it fails in this time. 

Thus the procedure should be:
  1. Set 'dfs.client.socket-timeout' in hdfs-site.xml to 3600000.
  2. start JobHistory server.
  3. Set 'dfs.client.socket-timeout' in hdfs-site.xml back to 180000.
  4. start-dfs.sh
If this shared configuration file is too obscure and ill-managed, we could specify the config file for job historyserver when starting it up via:
./mr-jobhistory-daemon.sh --config /home/supertool/hadoop-2.2.0/etc/job_history_server_config/  start historyserver



References:

  1. Hadoop2 Jobhistory Log - Dong
  2. Start MapReduce JobHistory Server



© 2014-2017 jason4zhu.blogspot.com All Rights Reserved 
If transfering, please annotate the origin: Jason4Zhu

10 comments:

  1. I configured the hadoop 2.6 according to your instructions and started the history server. But history web server did not show any job information. However, the job's history file was generated and stored in the HDFS. What's the reason for this? Thanks.

    ReplyDelete
    Replies
    1. If none of job information is shown in job historyserver, the above post, which depicts a randomly loss of job information, is possibly orthogonal to your scenario. If I were you, I would check out the log of job historyserver first to make sure there's no error/exception complains.

      Delete
  2. i am also facing the same issue. My job history server is not retrieving the jobs i have executed earlier. if you got the solution then please let me know

    ReplyDelete
    Replies
    1. Have you try setting 'dfs.client.socket-timeout' for job historyserver to 3600000?

      Delete
    2. This comment has been removed by the author.

      Delete
  3. This comment has been removed by the author.

    ReplyDelete
  4. As far as I'm concerned, your job historyserver works without malfuncion. Chances are that the culprit of your case is the mapred-site.xml configuration either fails to be loaded successfully, or it has been overriden by some hidden configuration file due to obscure loading order precedence of hadoop job.

    ReplyDelete
  5. The blog is so interactive and Informative , you should write more blogs like this Hadoop Administration Online course

    ReplyDelete
  6. Nice and good article. It is very useful for me to learn and understand easily. Thanks for sharing your valuable information and time. Please keep updating hadoop online training

    ReplyDelete
  7. Really very awesome post,keep doing much better posts with us.

    Thank you...

    big data online course

    ReplyDelete