Jason4Zhu: Some Mapper Tasks In Hadoop Hang For A Long Time With Epollwait Shown By Jstack

Phenomenon

When running our scheduled jobs at night, there will be some mapper task attempts hang without any progress for so long a time. According to Buckets Effect, the total time is determined by the shortest end. Thus, elapsed time of our job is dragged longer than it's supposed to be:

Footprint Of Solution

At first, we intend to ssh on that specific machine where the stuck mapper task attempt is running to check out what is going on, but ssh is stuck too. We check on ganglia, the total consumption of CPU, network IO as well as memory is relatively low, which means that all of them is not the bottleneck of this problem.

After this machine recovers, we login on it and check out the message buffer of the kernel, a mountain of "TCP: too many of orphaned sockets" is printed on screen:

> su root
> dmesg
Out of socket memory
TCP: too many of orphaned sockets
TCP: too many of orphaned sockets
TCP: too many of orphaned sockets
TCP: too many of orphaned sockets
TCP: too many of orphaned sockets
TCP: too many of orphaned sockets
TCP: too many of orphaned sockets
TCP: too many of orphaned sockets
...

In general, there are two possibilities which can cause "Out of socket memory", namely:

The server is running out of TCP memory
There are too many orphaned sockets on the system

For the first case, we can check the setting of TCP memory by:

cat /proc/sys/net/ipv4/tcp_mem
196608 262144 393216

tcp_mem is a vector of 3 integers: min, pressure and max.

min : below this number of pages TCP is not bothered about its memory consumption.
pressure: when the amount of memory allocated to TCP by the kernel exceeds this threshold, the kernel starts to moderate the memory consumption. This mode is exited when memory consumption falls under min.
max : the max number of pages allowed for queuing by all TCP sockets. When the system goes above this threshold, the kernel will start throwing the "Out of socket memory" error in the logs.

For the second case, the setting of max_orphan is shown by:

cat /proc/sys/net/ipv4/tcp_max_orphans
300000

The way to check out current orphan amount and TCP memory use is as follows:

cat /proc/net/sockstat
sockets: used 755
TCP: inuse 103 orphan 0 tw 0 alloc 308 mem 221
UDP: inuse 10 mem 0
RAW: inuse 0
FRAG: inuse 0 memory 0

In order to monitor on the current orphan amount as well as TCP memory use, I write a script:

//--orphan.sh--
#!/bin/bash
while true
do
date >> /home/hadoop/orphan.out
cat /proc/net/sockstat >> /home/hadoop/orphan.out
echo "" >> /home/hadoop/orphan.out
sleep 5
done


//--orphan distribution shell--
for i in $(cat $HADOOP_HOME/etc/hadoop/slaves | grep -v "#")
do
echo $i
scp orphan.sh hadoop@$i:/home/hadoop/;
ssh hadoop@$i "nohup bash orphan.sh 2>&1 &" &
done

After some time, I count on all the collected log in every node to see how many times each condition is triggered. For the first case, no orphan amount is greater than max_orphan. For the second case, there are some nodes in our Hadoop cluster that always exceed the max of TCP memory frequently:

for i in $(cat etc/hadoop/slaves | grep -v '#'); do out_mem=$(ssh hadoop@$i "grep TCP orphan.out  | awk '{if(\$NF >  393216)print}'  |wc -l"); echo $i, $out_mem; done

Surprisingly, all the problem-related nodes are relatively new, and are forgotten to tuning on TCP configurations. The reference for tuning is from Tuning Hadoop on Dell PowerEdge Servers.

Related commands for `sysctl`:

// The file to configure custom paramters.
vim /etc/sysctl.conf

// Command to list all configurations in sysctl.
/sbin/sysctl -a

The sysctl.conf of tuned node:

fs.file-max = 800000
net.core.rmem_default = 12697600
net.core.wmem_default = 12697600
net.core.rmem_max = 873800000
net.core.wmem_max = 655360000
net.ipv4.tcp_rmem = 8192 262144 4096000
net.ipv4.tcp_wmem = 4096 262144 4096000
net.ipv4.tcp_mem = 196608 262144 3145728
net.ipv4.tcp_max_orphans = 300000
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.ip_local_port_range = 1025 65535
net.ipv4.tcp_max_syn_backlog = 100000
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp_keepalive_time = 1200
net.ipv4.tcp_max_tw_buckets = 5000
net.ipv4.netfilter.ip_conntrack_tcp_timeout_established = 1500
net.core.somaxconn=32768
vm.swappiness=0

The difference between tuned and non-tuned nodes is shown as below:

//—- tuned --
fs.file-max = 800000
net.core.rmem_default = 12697600
net.core.wmem_default = 12697600
net.core.rmem_max = 873800000
net.core.wmem_max = 655360000
net.ipv4.tcp_rmem = 8192 262144 4096000
net.ipv4.tcp_wmem = 4096 262144 4096000
net.ipv4.tcp_max_orphans = 300000
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_max_tw_buckets = 5000
net.ipv4.ip_local_port_range = 1025 65535
net.ipv4.tcp_max_syn_backlog = 100000
net.ipv4.tcp_fin_timeout = 30
net.core.somaxconn=32768
vm.swappiness=0

//—- non-tuned --
fs.file-max = 400000
net.core.rmem_default = 12697600
net.core.wmem_default = 12697600
net.core.rmem_max = 873800000
net.core.wmem_max = 655360000
net.ipv4.tcp_rmem =  8192 8738000 873800000
net.ipv4.tcp_wmem = 4096 6553600 655360000
net.ipv4.tcp_max_orphans = 300000
net.ipv4.tcp_tw_reuse = 0
net.ipv4.tcp_tw_recycle = 0
net.ipv4.tcp_max_tw_buckets = 180000
net.ipv4.ip_local_port_range = 1025 65535
net.ipv4.tcp_max_syn_backlog = 100000
net.ipv4.tcp_fin_timeout = 2
net.core.somaxconn=128
vm.swappiness=60

After configuring all the non-tuned nodes exactly the same as tuned one and restart network via `/etc/init.d/network restart` (`/sbin/sysctl -p` can also reload the user-defined parameters set in /etc/sysctl.conf), error 'Out of socket memory' is gone.

But, hanging of mapper task attempts still exists. With more in-depth observation, mapper task attempts will progress smoothly if there are only one or two jobs running synchronously.

In this time, we could ssh to the specific node where the stuck mapper task attempt is running. Assuming the mapper task attempt is 'attempt_1413206225298_34853_m_000035_0', the following command can print out the stacktrace of this java process at that stuck state:

// Get the corresponding PID of this stuck mapper task attempt
/usr/java/jdk1.7.0_11/bin/jps -ml | grep attempt_1413206225298_34853_m_000035_0

// Print out current stack info
/usr/java/jdk1.7.0_11/bin/jstack 1570

"main" prio=10 tid=0x000000000a4e0000 nid=0x3abb runnable [0x000000004108c000]
   java.lang.Thread.State: RUNNABLE
        at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
        at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:228)
        at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:81)
        at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
        - locked <0x00000007a674bb38> (a sun.nio.ch.Util$2)
        - locked <0x00000007a674bb28> (a java.util.Collections$UnmodifiableSet)
        - locked <0x00000007a66487d8> (a sun.nio.ch.EPollSelectorImpl)
        at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
        at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
        at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
        at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.readChannelFully(PacketReceiver.java:258)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:209)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171)
        at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:102)
        at org.apache.hadoop.hdfs.RemoteBlockReader2.readNextPacket(RemoteBlockReader2.java:170)
        at org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:135)
        - locked <0x00000007b2d99ee0> (a org.apache.hadoop.hdfs.RemoteBlockReader2)
        at org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:642)
        at org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:698)
        - eliminated <0x00000007b2d91d68> (a org.apache.hadoop.hdfs.DFSInputStream)
        at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:752)
        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:793)
        - locked <0x00000007b2d91d68> (a org.apache.hadoop.hdfs.DFSInputStream)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at com.miaozhen.app.MzSequenceFile$PartInputStream.read(MzSequenceFile.java:453)
        at org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:127)
        at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:98)
        at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85)
        at java.io.DataInputStream.readFully(DataInputStream.java:195)
        at org.apache.hadoop.io.Text.readFields(Text.java:292)
        at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71)
        at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42)
        at com.miaozhen.app.MzSequenceFile$Reader.deserializeValue(MzSequenceFile.java:677)
        at com.miaozhen.app.MzSequenceFile$Reader.next(MzSequenceFile.java:694)
        at com.miaozhen.app.MzSequenceFile$Reader.next(MzSequenceFile.java:704)
        at com.miaozhen.yo.dicmanager.panel.NewPanel.getNextRow(NewPanel.java:379)
        at com.miaozhen.yo.dicmanager.panel.NewPanel.locate(NewPanel.java:414)
        at com.miaozhen.yo.tcpreporter.Stat.getPanel(Stat.java:66)
        at com.miaozhen.yo.tcpreporter.Stat.update(Stat.java:99)
        at com.miaozhen.yo.phase1.Phase1Mapper.updateHistory(Phase1Mapper.java:75)
        at com.miaozhen.yo.phase1.Phase1Mapper.map(Phase1Mapper.java:56)
        at com.miaozhen.yo.phase1.Phase1Mapper.map(Phase1Mapper.java:31)
        at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54)
        at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
        at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
        at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)

From the stacktrace, we could see the process is stuck at epollWait, which is in NIO.

It is likely that there are some sort of max connection amount for a file in HDFS. Since there are some global files in HDFS which will be read by all mapper task attempts for all jobs. In the above stacktrace, we could see that the global files mentioned above is the 'Panel' files. Thus we change the replication of these 'Panel' files from 3 to 31:

hadoop dfs -setrep -R -w 31 /tong/data/resource
hadoop dfs -ls -R /tong/data/resource

In this time, the mapper doesn't hang anymore.

In conclusion, when there are some mappers hang (To generalize, when a java thread is stuck), we should check out CPU, network IO, as well as memory use at first. If nothing goes awry, we then could apply `jstack` on this process to check out the stuck point and find the problem.

References:

Jason4Zhu

Friday, November 21, 2014

Some Mapper Tasks In Hadoop Hang For A Long Time With Epollwait Shown By Jstack

Phenomenon

Footprint Of Solution

No comments:

Post a Comment