Saturday, November 22, 2014

A Record On The Process Of Adding Memory Bank To Datanode In Hadoop

There was an emergency requirement that we need to increase the memory of DataNodes in our Hadoop cluster. Here's the detailed process, in which, two phases are involved. Namely, operation on DataNode and operation on YARN.

At the very beginning,  we should backup configuration file, which is in $HADOOP_HOME/etc/hadoop/*, in git.

At the same time, we should backup the runtime configuration of JobHistory server provided HistoryServer is enabled. This is well-explained in my another post: Dig Into JobHistory Server Of MapReduce In Hadoop2. The runtime configuration can be found at the monitoring webpage of JobHistory server:


Then, we should revise 'yarn.nodemanager.resource.memory-mb' argument in $HADOOP_HOME/etc/hadoop/yarn-site.xml applying to our new memory capacity and synchronizing it to all nodes in Hadoop cluster. Since DataNode doesn't restart for now, the change of configuration makes no effect at current time.
su hadoop

#all DataNodes
for i in $(cat $HADOOP_HOME/etc/hadoop/slaves | grep -v "#")
do
 echo '';
 echo $i;
 rsync -r --delete $HADOOP_HOME/etc/hadoop/ hadoop@$i:/home/workspace/hadoop/etc/hadoop/;
done

#NameNode
rsync -r --delete $HADOOP_HOME/etc/hadoop/ hadoop@k1202.hide.cn:/home/workspace/hadoop/etc/hadoop/;

For every DataNode:

1.1. Backup information about current block devices

We need to do so in case of disk info loses after we restart DataNode. Execute the following command and paste the output into '/etc/rc.local'.
--mount.sh--
n=1 ; for i in a b c d e f g h i j k l ; do a=`/sbin/blkid -s UUID | grep ^/dev/sd$i | awk '{print $2}'` ; echo mount $a /home/data$n ; n=`echo $n+1|bc` ; done

> bash mount.sh
mount UUID="09c42017-9308-45c3-9509-e77a2e99c732" /home/data1
mount UUID="72461da2-b0c0-432a-9b65-0ac5bc5bc69a" /home/data2
mount UUID="6d447f43-b2db-4f69-a3b2-a4f69f2544ea" /home/data3
mount UUID="37ca4fb8-377c-493d-9a4c-825f1500ae52" /home/data4
mount UUID="53334c93-13ff-41f5-8688-07023bd6f11a" /home/data5
mount UUID="10fa31f7-9c29-4190-8ecd-ec893d59634c" /home/data6
mount UUID="fe28b8dd-ff3b-49d9-87c6-6eee9f389966" /home/data7
mount UUID="5201d24b-9310-4cff-b3ad-5b09e47780a5" /home/data8
mount UUID="d3b85455-8b94-4817-b43e-69481f9c13c4" /home/data9
mount UUID="6f2630f1-7cfe-4cac-b52d-557f46779539" /home/data10
mount UUID="bafc742d-1477-439a-ade4-29711c5db840" /home/data11
mount UUID="bf6e36d8-1410-4547-853c-f541c9a07e52" /home/data12

1.2. Stop DataNode service

$HADOOP_HOME/sbin/hadoop-daemon.sh stop datanode

1.3. Stop NodeManager service

$HADOOP_HOME/sbin/yarn-daemon.sh stop nodemanager

1.4. Double-check on Java process

Check whether DataNode and NodeManager process have been stopped. If not, invoke `kill -9 PID` to stop the process forcibly.
ps aux | grep java

1.5. Shutdown DataNode

After issuing the following command to shutdown DataNode, we wait for the signal from our colleagues when they finish the installation of memory bank.
su root
/sbin/init 0

1.6. Check and operation on Linux when DataNode restarts again

After machine restarts again, check on the most significant part: memory, to be assure that it has been increased as expected.
free -g

Next, check on disk mount info. If not consistent with the backup one stated in 1.1., execute the backup command to reload it.
df

Open firewall (iptables), whose guide is in another of my post.

1.7. Start DataNode service

$HADOOP_HOME/sbin/hadoop-daemon.sh start datanode

1.8. Start NodeManager service

$HADOOP_HOME/sbin/yarn-daemon.sh start nodemanager

1.9. Check health condition

Check whether the process of DataNode and NodeManager exists:
ps aux | grep java

If does, look through $HADOOP_HOME/logs/*.log to make sure there is no vital ERROR in it.

That's all for operation on DataNode part. We need to repeat from 1.1. to 1.9. for every DataNode. Since our HDFS replication is set to 3, the maximum of failed DataNode that can be tolerated is 2. This should be kept in mind.

In the next section, it is the operation on YARN.

2.1. Double-check on all DataNodes

Look through all the DataNodes listed in YARN monitoring webpage, which is configured in yarn-site.xml: yarn.resourcemanager.webapp.address, to be assure that they all works normally.


2.2. Start/Stop service

Ssh to the node which HistoryServer, if any, is at. Shutdown the service. Double-check by `ps aux | grep java`, if the process still exists, execute `kill -9 PID` on it.
cd $HADOOP_HOME
$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh stop historyserver

Shutdown YARN service.
cd $HADOOP_HOME
$HADOOP_HOME/sbin/stop-yarn.sh

Check whether NodeManager process has been stopped on every DataNode, if not, we have to `kill -9 PID` it.
for i in $(cat $HADOOP_HOME/etc/hadoop/slaves | grep -v "#")
do
 echo '';
 echo $i;
 ssh supertool@$i "/usr/java/jdk1.7.0_11/bin/jps ";
done

Restart YARN service and check again using the above shell script to make sure all NodeManager processes have been started.
cd $HADOOP_HOME
$HADOOP_HOME/sbin/start-yarn.sh

Restart HistoryServer, if any.
cd $HADOOP_HOME
$HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver


© 2014-2017 jason4zhu.blogspot.com All Rights Reserved 
If transfering, please annotate the origin: Jason4Zhu

1 comment: