There was an emergency requirement that we need to increase the memory of DataNodes in our Hadoop cluster. Here's the detailed process, in which, two phases are involved. Namely, operation on DataNode and operation on YARN.
At the very beginning, we should backup configuration file, which is in $HADOOP_HOME/etc/hadoop/*, in git.
At the same time, we should backup the runtime configuration of JobHistory server provided HistoryServer is enabled. This is well-explained in my another post: Dig Into JobHistory Server Of MapReduce In Hadoop2. The runtime configuration can be found at the monitoring webpage of JobHistory server:
Then, we should revise 'yarn.nodemanager.resource.memory-mb' argument in $HADOOP_HOME/etc/hadoop/yarn-site.xml applying to our new memory capacity and synchronizing it to all nodes in Hadoop cluster. Since DataNode doesn't restart for now, the change of configuration makes no effect at current time.
For every DataNode:
We need to do so in case of disk info loses after we restart DataNode. Execute the following command and paste the output into '/etc/rc.local'.
1.2. Stop DataNode service
1.3. Stop NodeManager service
1.4. Double-check on Java process
Check whether DataNode and NodeManager process have been stopped. If not, invoke `kill -9 PID` to stop the process forcibly.
1.5. Shutdown DataNode
After issuing the following command to shutdown DataNode, we wait for the signal from our colleagues when they finish the installation of memory bank.
1.6. Check and operation on Linux when DataNode restarts again
After machine restarts again, check on the most significant part: memory, to be assure that it has been increased as expected.
Next, check on disk mount info. If not consistent with the backup one stated in 1.1., execute the backup command to reload it.
Open firewall (iptables), whose guide is in another of my post.
1.7. Start DataNode service
1.8. Start NodeManager service
1.9. Check health condition
Check whether the process of DataNode and NodeManager exists:
If does, look through $HADOOP_HOME/logs/*.log to make sure there is no vital ERROR in it.
That's all for operation on DataNode part. We need to repeat from 1.1. to 1.9. for every DataNode. Since our HDFS replication is set to 3, the maximum of failed DataNode that can be tolerated is 2. This should be kept in mind.
In the next section, it is the operation on YARN.
2.1. Double-check on all DataNodes
Look through all the DataNodes listed in YARN monitoring webpage, which is configured in yarn-site.xml: yarn.resourcemanager.webapp.address, to be assure that they all works normally.
2.2. Start/Stop service
Ssh to the node which HistoryServer, if any, is at. Shutdown the service. Double-check by `ps aux | grep java`, if the process still exists, execute `kill -9 PID` on it.
Shutdown YARN service.
Check whether NodeManager process has been stopped on every DataNode, if not, we have to `kill -9 PID` it.
Restart YARN service and check again using the above shell script to make sure all NodeManager processes have been started.
Restart HistoryServer, if any.
© 2014-2017 jason4zhu.blogspot.com All Rights Reserved
If transfering, please annotate the origin: Jason4Zhu
At the very beginning, we should backup configuration file, which is in $HADOOP_HOME/etc/hadoop/*, in git.
At the same time, we should backup the runtime configuration of JobHistory server provided HistoryServer is enabled. This is well-explained in my another post: Dig Into JobHistory Server Of MapReduce In Hadoop2. The runtime configuration can be found at the monitoring webpage of JobHistory server:
Then, we should revise 'yarn.nodemanager.resource.memory-mb' argument in $HADOOP_HOME/etc/hadoop/yarn-site.xml applying to our new memory capacity and synchronizing it to all nodes in Hadoop cluster. Since DataNode doesn't restart for now, the change of configuration makes no effect at current time.
su hadoop #all DataNodes for i in $(cat $HADOOP_HOME/etc/hadoop/slaves | grep -v "#") do echo ''; echo $i; rsync -r --delete $HADOOP_HOME/etc/hadoop/ hadoop@$i:/home/workspace/hadoop/etc/hadoop/; done #NameNode rsync -r --delete $HADOOP_HOME/etc/hadoop/ hadoop@k1202.hide.cn:/home/workspace/hadoop/etc/hadoop/;
For every DataNode:
1.1. Backup information about current block devices
We need to do so in case of disk info loses after we restart DataNode. Execute the following command and paste the output into '/etc/rc.local'.
--mount.sh-- n=1 ; for i in a b c d e f g h i j k l ; do a=`/sbin/blkid -s UUID | grep ^/dev/sd$i | awk '{print $2}'` ; echo mount $a /home/data$n ; n=`echo $n+1|bc` ; done > bash mount.sh mount UUID="09c42017-9308-45c3-9509-e77a2e99c732" /home/data1 mount UUID="72461da2-b0c0-432a-9b65-0ac5bc5bc69a" /home/data2 mount UUID="6d447f43-b2db-4f69-a3b2-a4f69f2544ea" /home/data3 mount UUID="37ca4fb8-377c-493d-9a4c-825f1500ae52" /home/data4 mount UUID="53334c93-13ff-41f5-8688-07023bd6f11a" /home/data5 mount UUID="10fa31f7-9c29-4190-8ecd-ec893d59634c" /home/data6 mount UUID="fe28b8dd-ff3b-49d9-87c6-6eee9f389966" /home/data7 mount UUID="5201d24b-9310-4cff-b3ad-5b09e47780a5" /home/data8 mount UUID="d3b85455-8b94-4817-b43e-69481f9c13c4" /home/data9 mount UUID="6f2630f1-7cfe-4cac-b52d-557f46779539" /home/data10 mount UUID="bafc742d-1477-439a-ade4-29711c5db840" /home/data11 mount UUID="bf6e36d8-1410-4547-853c-f541c9a07e52" /home/data12
1.2. Stop DataNode service
$HADOOP_HOME/sbin/hadoop-daemon.sh stop datanode
1.3. Stop NodeManager service
$HADOOP_HOME/sbin/yarn-daemon.sh stop nodemanager
1.4. Double-check on Java process
Check whether DataNode and NodeManager process have been stopped. If not, invoke `kill -9 PID` to stop the process forcibly.
ps aux | grep java
1.5. Shutdown DataNode
After issuing the following command to shutdown DataNode, we wait for the signal from our colleagues when they finish the installation of memory bank.
su root /sbin/init 0
1.6. Check and operation on Linux when DataNode restarts again
After machine restarts again, check on the most significant part: memory, to be assure that it has been increased as expected.
free -g
Next, check on disk mount info. If not consistent with the backup one stated in 1.1., execute the backup command to reload it.
df
Open firewall (iptables), whose guide is in another of my post.
1.7. Start DataNode service
$HADOOP_HOME/sbin/hadoop-daemon.sh start datanode
1.8. Start NodeManager service
$HADOOP_HOME/sbin/yarn-daemon.sh start nodemanager
1.9. Check health condition
Check whether the process of DataNode and NodeManager exists:
ps aux | grep java
If does, look through $HADOOP_HOME/logs/*.log to make sure there is no vital ERROR in it.
That's all for operation on DataNode part. We need to repeat from 1.1. to 1.9. for every DataNode. Since our HDFS replication is set to 3, the maximum of failed DataNode that can be tolerated is 2. This should be kept in mind.
In the next section, it is the operation on YARN.
2.1. Double-check on all DataNodes
Look through all the DataNodes listed in YARN monitoring webpage, which is configured in yarn-site.xml: yarn.resourcemanager.webapp.address, to be assure that they all works normally.
2.2. Start/Stop service
Ssh to the node which HistoryServer, if any, is at. Shutdown the service. Double-check by `ps aux | grep java`, if the process still exists, execute `kill -9 PID` on it.
cd $HADOOP_HOME $HADOOP_HOME/sbin/mr-jobhistory-daemon.sh stop historyserver
Shutdown YARN service.
cd $HADOOP_HOME $HADOOP_HOME/sbin/stop-yarn.sh
Check whether NodeManager process has been stopped on every DataNode, if not, we have to `kill -9 PID` it.
for i in $(cat $HADOOP_HOME/etc/hadoop/slaves | grep -v "#") do echo ''; echo $i; ssh supertool@$i "/usr/java/jdk1.7.0_11/bin/jps "; done
Restart YARN service and check again using the above shell script to make sure all NodeManager processes have been started.
cd $HADOOP_HOME $HADOOP_HOME/sbin/start-yarn.sh
Restart HistoryServer, if any.
cd $HADOOP_HOME $HADOOP_HOME/sbin/mr-jobhistory-daemon.sh start historyserver
© 2014-2017 jason4zhu.blogspot.com All Rights Reserved
If transfering, please annotate the origin: Jason4Zhu
Nice information, too useful when am reading thank you
ReplyDeleteBest Online Training Institute in Chennai | Best Software Training Institute in Chennai