Phenomenon
When running our scheduled jobs at night, there will be some mapper task attempts hang without any progress for so long a time. According to Buckets Effect, the total time is determined by the shortest end. Thus, elapsed time of our job is dragged longer than it's supposed to be:Footprint Of Solution
At first, we intend to ssh on that specific machine where the stuck mapper task attempt is running to check out what is going on, but ssh is stuck too. We check on ganglia, the total consumption of CPU, network IO as well as memory is relatively low, which means that all of them is not the bottleneck of this problem.After this machine recovers, we login on it and check out the message buffer of the kernel, a mountain of "TCP: too many of orphaned sockets" is printed on screen:
> su root > dmesg Out of socket memory TCP: too many of orphaned sockets TCP: too many of orphaned sockets TCP: too many of orphaned sockets TCP: too many of orphaned sockets TCP: too many of orphaned sockets TCP: too many of orphaned sockets TCP: too many of orphaned sockets TCP: too many of orphaned sockets ...
In general, there are two possibilities which can cause "Out of socket memory", namely:
- The server is running out of TCP memory
- There are too many orphaned sockets on the system
For the first case, we can check the setting of TCP memory by:
tcp_mem is a vector of 3 integers: min, pressure and max.cat /proc/sys/net/ipv4/tcp_mem 196608 262144 393216
- min : below this number of pages TCP is not bothered about its memory consumption.
- pressure: when the amount of memory allocated to TCP by the kernel exceeds this threshold, the kernel starts to moderate the memory consumption. This mode is exited when memory consumption falls under min.
- max : the max number of pages allowed for queuing by all TCP sockets. When the system goes above this threshold, the kernel will start throwing the "Out of socket memory" error in the logs.
For the second case, the setting of max_orphan is shown by:
cat /proc/sys/net/ipv4/tcp_max_orphans 300000
The way to check out current orphan amount and TCP memory use is as follows:
cat /proc/net/sockstat sockets: used 755 TCP: inuse 103 orphan 0 tw 0 alloc 308 mem 221 UDP: inuse 10 mem 0 RAW: inuse 0 FRAG: inuse 0 memory 0
In order to monitor on the current orphan amount as well as TCP memory use, I write a script:
//--orphan.sh-- #!/bin/bash while true do date >> /home/hadoop/orphan.out cat /proc/net/sockstat >> /home/hadoop/orphan.out echo "" >> /home/hadoop/orphan.out sleep 5 done //--orphan distribution shell-- for i in $(cat $HADOOP_HOME/etc/hadoop/slaves | grep -v "#") do echo $i scp orphan.sh hadoop@$i:/home/hadoop/; ssh hadoop@$i "nohup bash orphan.sh 2>&1 &" & done
After some time, I count on all the collected log in every node to see how many times each condition is triggered. For the first case, no orphan amount is greater than max_orphan. For the second case, there are some nodes in our Hadoop cluster that always exceed the max of TCP memory frequently:
for i in $(cat etc/hadoop/slaves | grep -v '#'); do out_mem=$(ssh hadoop@$i "grep TCP orphan.out | awk '{if(\$NF > 393216)print}' |wc -l"); echo $i, $out_mem; done
Surprisingly, all the problem-related nodes are relatively new, and are forgotten to tuning on TCP configurations. The reference for tuning is from Tuning Hadoop on Dell PowerEdge Servers.
Related commands for `sysctl`:
// The file to configure custom paramters. vim /etc/sysctl.conf // Command to list all configurations in sysctl. /sbin/sysctl -a
The sysctl.conf of tuned node:
fs.file-max = 800000 net.core.rmem_default = 12697600 net.core.wmem_default = 12697600 net.core.rmem_max = 873800000 net.core.wmem_max = 655360000 net.ipv4.tcp_rmem = 8192 262144 4096000 net.ipv4.tcp_wmem = 4096 262144 4096000 net.ipv4.tcp_mem = 196608 262144 3145728 net.ipv4.tcp_max_orphans = 300000 net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_tw_recycle = 1 net.ipv4.ip_local_port_range = 1025 65535 net.ipv4.tcp_max_syn_backlog = 100000 net.ipv4.tcp_fin_timeout = 30 net.ipv4.tcp_keepalive_time = 1200 net.ipv4.tcp_max_tw_buckets = 5000 net.ipv4.netfilter.ip_conntrack_tcp_timeout_established = 1500 net.core.somaxconn=32768 vm.swappiness=0
The difference between tuned and non-tuned nodes is shown as below:
//—- tuned -- fs.file-max = 800000 net.core.rmem_default = 12697600 net.core.wmem_default = 12697600 net.core.rmem_max = 873800000 net.core.wmem_max = 655360000 net.ipv4.tcp_rmem = 8192 262144 4096000 net.ipv4.tcp_wmem = 4096 262144 4096000 net.ipv4.tcp_max_orphans = 300000 net.ipv4.tcp_tw_reuse = 1 net.ipv4.tcp_tw_recycle = 1 net.ipv4.tcp_max_tw_buckets = 5000 net.ipv4.ip_local_port_range = 1025 65535 net.ipv4.tcp_max_syn_backlog = 100000 net.ipv4.tcp_fin_timeout = 30 net.core.somaxconn=32768 vm.swappiness=0 //—- non-tuned -- fs.file-max = 400000 net.core.rmem_default = 12697600 net.core.wmem_default = 12697600 net.core.rmem_max = 873800000 net.core.wmem_max = 655360000 net.ipv4.tcp_rmem = 8192 8738000 873800000 net.ipv4.tcp_wmem = 4096 6553600 655360000 net.ipv4.tcp_max_orphans = 300000 net.ipv4.tcp_tw_reuse = 0 net.ipv4.tcp_tw_recycle = 0 net.ipv4.tcp_max_tw_buckets = 180000 net.ipv4.ip_local_port_range = 1025 65535 net.ipv4.tcp_max_syn_backlog = 100000 net.ipv4.tcp_fin_timeout = 2 net.core.somaxconn=128 vm.swappiness=60
After configuring all the non-tuned nodes exactly the same as tuned one and restart network via `/etc/init.d/network restart` (`/sbin/sysctl -p` can also reload the user-defined parameters set in /etc/sysctl.conf), error 'Out of socket memory' is gone.
But, hanging of mapper task attempts still exists. With more in-depth observation, mapper task attempts will progress smoothly if there are only one or two jobs running synchronously.
In this time, we could ssh to the specific node where the stuck mapper task attempt is running. Assuming the mapper task attempt is 'attempt_1413206225298_34853_m_000035_0', the following command can print out the stacktrace of this java process at that stuck state:
// Get the corresponding PID of this stuck mapper task attempt /usr/java/jdk1.7.0_11/bin/jps -ml | grep attempt_1413206225298_34853_m_000035_0 // Print out current stack info /usr/java/jdk1.7.0_11/bin/jstack 1570 "main" prio=10 tid=0x000000000a4e0000 nid=0x3abb runnable [0x000000004108c000] java.lang.Thread.State: RUNNABLE at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method) at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:228) at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:81) at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87) - locked <0x00000007a674bb38> (a sun.nio.ch.Util$2) - locked <0x00000007a674bb28> (a java.util.Collections$UnmodifiableSet) - locked <0x00000007a66487d8> (a sun.nio.ch.EPollSelectorImpl) at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98) at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335) at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.readChannelFully(PacketReceiver.java:258) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doReadFully(PacketReceiver.java:209) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.doRead(PacketReceiver.java:171) at org.apache.hadoop.hdfs.protocol.datatransfer.PacketReceiver.receiveNextPacket(PacketReceiver.java:102) at org.apache.hadoop.hdfs.RemoteBlockReader2.readNextPacket(RemoteBlockReader2.java:170) at org.apache.hadoop.hdfs.RemoteBlockReader2.read(RemoteBlockReader2.java:135) - locked <0x00000007b2d99ee0> (a org.apache.hadoop.hdfs.RemoteBlockReader2) at org.apache.hadoop.hdfs.DFSInputStream$ByteArrayStrategy.doRead(DFSInputStream.java:642) at org.apache.hadoop.hdfs.DFSInputStream.readBuffer(DFSInputStream.java:698) - eliminated <0x00000007b2d91d68> (a org.apache.hadoop.hdfs.DFSInputStream) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:752) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:793) - locked <0x00000007b2d91d68> (a org.apache.hadoop.hdfs.DFSInputStream) at java.io.DataInputStream.read(DataInputStream.java:149) at com.miaozhen.app.MzSequenceFile$PartInputStream.read(MzSequenceFile.java:453) at org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:127) at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:98) at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:85) at java.io.DataInputStream.readFully(DataInputStream.java:195) at org.apache.hadoop.io.Text.readFields(Text.java:292) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:71) at org.apache.hadoop.io.serializer.WritableSerialization$WritableDeserializer.deserialize(WritableSerialization.java:42) at com.miaozhen.app.MzSequenceFile$Reader.deserializeValue(MzSequenceFile.java:677) at com.miaozhen.app.MzSequenceFile$Reader.next(MzSequenceFile.java:694) at com.miaozhen.app.MzSequenceFile$Reader.next(MzSequenceFile.java:704) at com.miaozhen.yo.dicmanager.panel.NewPanel.getNextRow(NewPanel.java:379) at com.miaozhen.yo.dicmanager.panel.NewPanel.locate(NewPanel.java:414) at com.miaozhen.yo.tcpreporter.Stat.getPanel(Stat.java:66) at com.miaozhen.yo.tcpreporter.Stat.update(Stat.java:99) at com.miaozhen.yo.phase1.Phase1Mapper.updateHistory(Phase1Mapper.java:75) at com.miaozhen.yo.phase1.Phase1Mapper.map(Phase1Mapper.java:56) at com.miaozhen.yo.phase1.Phase1Mapper.map(Phase1Mapper.java:31) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:54) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:429) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:415) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)
From the stacktrace, we could see the process is stuck at epollWait, which is in NIO.
It is likely that there are some sort of max connection amount for a file in HDFS. Since there are some global files in HDFS which will be read by all mapper task attempts for all jobs. In the above stacktrace, we could see that the global files mentioned above is the 'Panel' files. Thus we change the replication of these 'Panel' files from 3 to 31:
hadoop dfs -setrep -R -w 31 /tong/data/resource hadoop dfs -ls -R /tong/data/resource
In this time, the mapper doesn't hang anymore.
In conclusion, when there are some mappers hang (To generalize, when a java thread is stuck), we should check out CPU, network IO, as well as memory use at first. If nothing goes awry, we then could apply `jstack` on this process to check out the stuck point and find the problem.
References:
© 2014-2017 jason4zhu.blogspot.com All Rights Reserved
If transfering, please annotate the origin: Jason4Zhu
No comments:
Post a Comment