Jason4Zhu: Note On NameNode HA

The overall procedure is well-explained in the references listed at the bottom. Here's just some essential points that I want to emphasize.

Here's the configurations for our Hadoop cluster, which is all related with NameNode HA:

core-site.xml:
<property>
    <name>fs.defaultFS</name>
    <value>hdfs://ns1</value>
</property>

<property>
    <name>ha.zookeeper.quorum</name>
    <value>644v4.mzhen.cn:2181,644v5.mzhen.cn:2181,644v6.mzhen.cn:2181</value>
</property>

hdfs-site.xml:
<property>
    <name>dfs.nameservices</name>
    <value>ns1</value>
</property>

<property>
     <name>dfs.ha.namenodes.ns1</name>
     <value>nn1,nn2</value>
</property>

<property>
     <name>dfs.namenode.rpc-address.ns1.nn1</name>
     <value>644v1.mzhen.cn:9000</value>
</property>

<property>
     <name>dfs.namenode.rpc-address.ns1.nn2</name>
     <value>644v2.mzhen.cn:9000</value>
</property>

<property>
     <name>dfs.namenode.http-address.ns1.nn1</name>
     <value>644v1.mzhen.cn:10001</value>
</property>

<property>
     <name>dfs.namenode.http-address.ns1.nn2</name>
     <value>644v2.mzhen.cn:10001</value>
</property>

<property>
     <name>dfs.namenode.shared.edits.dir</name>
     <value>qjournal://644v4.mzhen.cn:8485;644v5.mzhen.cn:8485;644v6.mzhen.cn:8485/ns1</value>
</property>

<property>
     <name>dfs.journalnode.edits.dir</name>
     <value>/home/data/hdfsdir/journal</value>
</property>

<property>
     <name>dfs.client.failover.proxy.provider.ns1</name>
     <value>org.apache.hadoop.hdfs.server.namenode.ha.ConfiguredFailoverProxyProvider</value>
</property>

<property>
     <name>dfs.ha.fencing.methods</name>
     <value>sshfence</value>
</property>

<property>
     <name>dfs.ha.fencing.ssh.private-key-files</name>
     <value>/home/supertool/.ssh/id_rsa</value>
</property>

<property>
     <name>dfs.ha.automatic-failover.enabled</name>
     <value>true</value>
</property>

After Hadoop cluster is fully started, there are some checkpoints that should be verified to make sure NameNode HA is fully applied:

1. Nodes corresponding to "dfs.ha.namenodes.ns1" argument in hdfs-site.xml should have processes named "DFSZKFailoverController", "NameNode".
2. Nodes corresponding to "dfs.namenode.shared.edits.dir" argument in hdfs-site.xml should have process named "JournalNode".
3. Nodes corresponding to "ha.zookeeper.quorum" argument in core-site.xml should have process named "QuorumPeerMain".

The way to launch all the processes above, if needed to do so respectively, is listed as below:

QuorumPeerMain: The service for ZooKeeper.
  bin/zkServer.sh start
  bin/zkServer.sh status
  bin/zkServer.sh stop
bin/zkServer.sh restart

JournalNode: In order for the Standby node to keep its state synchronized with the Active node, both nodes communicate with a group of separate daemons called "JournalNodes" (JNs)
  ./sbin/hadoop-daemon.sh stop journalnode
  ./sbin/hadoop-daemon.sh start journalnode

NameNode:
  ./sbin/hadoop-daemon.sh stop namenode
  ./sbin/hadoop-daemon.sh start namenode

DFSZKFailoverController:
  ./sbin/hadoop-daemon.sh stop zkfc
  ./sbin/hadoop-daemon.sh start zkfc
Attention that if the above command fails to start with no explicit errors, you could try executing command `./bin/hdfs zkfc` so as to retrieve detailed information.

Lastly, Some common commands relevant with NameNode HA is listed here:

## Get the status of NameNode, active or standby.
hdfs haadmin -getServiceState nn1

## Transfer a NameNode to active manually, which requires 'dfs.ha.automatic-failover.enabled' be set to 'false'.
hdfs haadmin -transitionToActive nn1

Reference:
1. High Availability for Hadoop - Hortonworks
2. HDFS High Availability Using the Quorum Journal Manager - Apache Hadoop

Jason4Zhu

Wednesday, January 21, 2015

Note On NameNode HA

No comments:

Post a Comment