Wednesday, November 5, 2014

Fair Scheduler In YARN, Hadoop-2.2.0 - Experiment On Preemption

Series:

Since it is said that preemption in Hadoop-2.2.0 is experimental, we are going to take an experiment on how it works.

Our Allocation File (fair-scheduler.xml) is configured as follows:
<?xml version="1.0" encoding="utf-8"?>

<allocations>
  <queue name="default">
    <minResources>6000 mb,12vcores</minResources> 
    <maxResources>300000 mb,48vcores</maxResources> 
    <maxRunningApps>60</maxRunningApps> 
    <weight>1.0</weight> 
    <schedulingPolicy>fair</schedulingPolicy> 
    <minSharePreemptionTimeout>1</minSharePreemptionTimeout>
  </queue> 
  <queue name="supertool">
    <minResources>18000 mb,36vcores</minResources> 
    <maxResources>30000 mb,48vcores</maxResources> 
    <maxRunningApps>60</maxRunningApps> 
    <weight>1.0</weight> 
    <schedulingPolicy>fair</schedulingPolicy> 
    <minSharePreemptionTimeout>1</minSharePreemptionTimeout>
  </queue> 
  <userMaxAppsDefault>5</userMaxAppsDefault> 
  <fairSharePreemptionTimeout>1</fairSharePreemptionTimeout> 
  <defaultQueueSchedulingPolicy>fifo</defaultQueueSchedulingPolicy>
</allocations>

As we can see, there are two queues, namely, root.default and root.supertool.

The total memory in the cluster is 24GB, thus for both of the queues, MaxResources is set to the total memory of cluster.

MinResources of root.supertool is 18GB, whereas that of root.default is 6GB. The former is 3 times to the latter one.

For fairSharePreemptionTimeout and minSharePreemptionTimeout, they are both set to 1 second, that is to say, if preemption is needed, it will preempt resources as quickly as possible.

Our test case script is as below:
#!/bin/bash

source ~/.profile #load environment variable: HADOOP_HOME

#start pi-calculation MapReduce application on root.default for 4 times
nohup hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples- 2.2.0.jar pi -Dmapred.job.queue.name=root.default 12 1000000000 2>&1 > nonono1.out & 
nohup hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples- 2.2.0.jar pi -Dmapred.job.queue.name=root.default 12 1000000000 2>&1 > nonono2.out & 
nohup hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples- 2.2.0.jar pi -Dmapred.job.queue.name=root.default 12 1000000000 2>&1 > nonono3.out & 
nohup hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples- 2.2.0.jar pi -Dmapred.job.queue.name=root.default 12 1000000000 2>&1 > nonono4.out &

#sleep some seconds waiting for the above applications to start completely
echo "sleep 60 seconds."
sleep 60

#start pi-calculation MapReduce application on root.supertool
nohup hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples- 2.2.0.jar pi -Dmapred.job.queue.name=root.supertool 12 1000000000 2>&1 > nononom.out &

echo "done!"

Firstly, we'll start 4 MapReduce application in root.default. After 60 seconds, which will give the above 4 tasks enough time to fully start, another MapReduce application will be launched in root.supertool.

Case 1: Preemption Off


<property>
  <name>yarn.scheduler.fair.preemption</name>
  <value>false</value>
</property>

root.default occupies 100% resources of cluster when the first 4 applications is in.



After 60 seconds, when the application submitted to root.supertool starts, resources, which are released normally from mappers or reducers in applications from root.default, is allocated preferentially to root.supertool because root.supertool is far below its FairShare. Application in root.supertool nevertheless will not preempt resources from applications in root.default due to the turn-off of preemption.

There are several snapshots on YARN monitor webpage in chronological order:





As we can see, the ratio of resource occupation between root.supertool and root.default is 66.7/33.3≈2, which doesn't  manage to achieve 18GB/6GB=3 configured in fair-scheduler.xml.

FYI, there is a dashed box to the right of green bar for root.supertool, that represents the amount of FairShare for the queue.



Case 2: Preemption On


<property>
  <name>yarn.scheduler.fair.preemption</name>
  <value>true</value>
</property>

In this time, root.default still occupies 100% resources of cluster when the first 4 applications is in.

When the subsequent application is submitted to root.supertool after 60 seconds, because the preemption is on and root.supertool is below its FairShare, thus the application will preempt resources from applications located in root.default. The related log of ApplicationMaster for one of the application in root.default is as follows:


From the log, we can see that the procedure of preemption is transparent to user. When preemption happens, records like 'org.apache.hadoop.mapreduce.v2.app.job.impl.TaskAttemptImpl: Diagnostics report from attempt_1414999804378_0022_m_000000_0: Container preempted by scheduler' will be printed to the log.

Again, the snapshots on YARN monitor webpage in chronological order:



This time, when the application in root.supertool is launched, it will preempt resources from applications in root.default until its own FairShare is met. The ratio of resource occupation between root.supertool and root.default is 75%/25%=3, which totally matches 18GB/6GB=3 which is set in fair-scheduler.xml.

In conclusion, application in root.supertool will starve if all the mappers and reducers from applications in root.default take a super long time, say a week,  provided preemption is off. Whereas if preemption is on, every queue will guaranteed at least MinResources of resource to run applications in it, thus no application will starve and will get resources fairly to the extent.


© 2014-2017 jason4zhu.blogspot.com All Rights Reserved 
If transfering, please annotate the origin: Jason4Zhu

No comments:

Post a Comment