Jason4Zhu: 2014

Tuesday, December 30, 2014

Notes On Troubleshooting Tools For Java Program

jps

-- List all java processes and related information, like fully qualified main class names, arguments passed to main method, etc.

[hadoop@m15018 ~]$ jps -ml
30134 org.apache.hadoop.hdfs.server.datanode.DataNode
15141 org.apache.hadoop.mapred.YarnChild 192.168.7.27 28170 attempt_1417419796124_63329_m_000018_1 97
1820 org.apache.hadoop.mapred.YarnChild 192.168.7.14 36058 attempt_1417419796124_63299_m_000037_0 33
28191 org.apache.hadoop.mapred.YarnChild 192.168.7.21 55163 attempt_1417419796124_63322_m_000018_0 21
15275 org.apache.hadoop.mapreduce.v2.app.MRAppMaster
18325 org.apache.hadoop.mapred.YarnChild 192.168.7.55 6926 attempt_1417419796124_63255_m_000014_1 166
10693 org.apache.hadoop.mapred.YarnChild 192.168.7.85 20883 attempt_1417419796124_63282_m_000055_0 34
6001 org.apache.hadoop.mapred.YarnChild 192.168.7.17 9187 attempt_1417419796124_62366_m_000050_0 64
30311 org.apache.hadoop.yarn.server.nodemanager.NodeManager
30111 org.apache.hadoop.mapred.YarnChild 192.168.7.75 23820 attempt_1417419796124_63324_m_000047_0 29
28712 org.apache.hadoop.mapreduce.v2.app.MRAppMaster
4474 org.apache.hadoop.mapred.YarnChild 192.168.7.29 49996 attempt_1417419796124_62449_r_000019_0 101
6041 org.apache.hadoop.mapred.YarnChild 192.168.7.17 9187 attempt_1417419796124_62366_r_000008_0 57
6792 org.apache.hadoop.mapred.YarnChild 192.168.7.20 56878 attempt_1417419796124_63313_m_000016_0 18
25847 org.apache.hadoop.mapred.YarnChild 192.168.7.46 8277 attempt_1417419796124_63290_m_000005_0 7
6089 org.apache.hadoop.mapred.YarnChild 192.168.7.17 9187 attempt_1417419796124_62366_r_000005_0 50
20277 org.apache.hadoop.mapred.YarnChild 192.168.7.26 45093 attempt_1417419796124_63268_m_000014_0 20
5578 org.apache.hadoop.mapred.YarnChild 192.168.7.72 15929 attempt_1417419796124_63271_m_000011_0 14
26194 org.apache.hadoop.mapred.YarnChild 192.168.7.46 8277 attempt_1417419796124_63290_m_000011_0 13
18747 sun.tools.jps.Jps -ml

In which:

-m Output the arguments passed to the main method. The output may be null for embedded JVMs.
-l Output the full package name for the application’s main class or the full path name to the application’s JAR file.
-v Output the arguments passed to the JVM.

jstack

-- Show current stack trace of a running java process.

[hadoop@K1213 ~]$ jstack 19552
2014-12-31 10:21:05
Full thread dump Java HotSpot(TM) 64-Bit Server VM (23.6-b04 mixed mode):

"Attach Listener" daemon prio=10 tid=0x000000000a3ae000 nid=0x81a waiting on condition [0x0000000000000000]
   java.lang.Thread.State: RUNNABLE

"ResponseProcessor for block BP-714842383-192.168.7.11-1393991369860:blk_1111587326_1099569551548" daemon prio=10 tid=0x00002aaab8293000 nid=0x7098 runnable [0x00000000417e5000]
   java.lang.Thread.State: RUNNABLE
 at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
 at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:228)
 at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:81)
 at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
 - locked <0x000000077742ab40> (a sun.nio.ch.Util$2)
 - locked <0x000000077742ab50> (a java.util.Collections$UnmodifiableSet)
 - locked <0x000000077742aaf8> (a sun.nio.ch.EPollSelectorImpl)
 at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
 at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
 at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
 at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:161)
 at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:131)
 at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:118)
 at java.io.FilterInputStream.read(FilterInputStream.java:83)
 at java.io.FilterInputStream.read(FilterInputStream.java:83)
 at org.apache.hadoop.hdfs.protocolPB.PBHelper.vintPrefixed(PBHelper.java:1490)
 at org.apache.hadoop.hdfs.protocol.datatransfer.PipelineAck.readFields(PipelineAck.java:116)
 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer$ResponseProcessor.run(DFSOutputStream.java:721)

"DataStreamer for file /user/monitor/test_with_reduce_zhudi/_temporary/1/_temporary/attempt_1417419796124_63366_m_000033_0/0-m-00033 block BP-714842383-192.168.7.11-1393991369860:blk_1111587326_1099569551548" daemon prio=10 tid=0x00002aaab4cbb000 nid=0x4ca3 runnable [0x0000000040316000]
   java.lang.Thread.State: RUNNABLE
 at sun.nio.ch.EPollArrayWrapper.epollWait(Native Method)
 at sun.nio.ch.EPollArrayWrapper.poll(EPollArrayWrapper.java:228)
 at sun.nio.ch.EPollSelectorImpl.doSelect(EPollSelectorImpl.java:81)
 at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:87)
 - locked <0x000000077742e4e8> (a sun.nio.ch.Util$2)
 - locked <0x000000077742e4f8> (a java.util.Collections$UnmodifiableSet)
 - locked <0x000000077742e4a0> (a sun.nio.ch.EPollSelectorImpl)
 at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:98)
 at org.apache.hadoop.net.SocketIOWithTimeout$SelectorPool.select(SocketIOWithTimeout.java:335)
 at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:157)
 at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:159)
 at org.apache.hadoop.net.SocketOutputStream.write(SocketOutputStream.java:117)
 at java.io.BufferedOutputStream.write(BufferedOutputStream.java:122)
 - locked <0x0000000747910230> (a java.io.BufferedOutputStream)
 at java.io.DataOutputStream.write(DataOutputStream.java:107)
 - locked <0x0000000747910248> (a java.io.DataOutputStream)
 at org.apache.hadoop.hdfs.DFSOutputStream$Packet.writeTo(DFSOutputStream.java:278)
 at org.apache.hadoop.hdfs.DFSOutputStream$DataStreamer.run(DFSOutputStream.java:568)

...

"main" prio=10 tid=0x0000000009fad000 nid=0x4c61 in Object.wait() [0x000000004089e000]
   java.lang.Thread.State: WAITING (on object monitor)
 at java.lang.Object.wait(Native Method)
 at java.lang.Object.wait(Object.java:503)
 at org.apache.hadoop.hdfs.DFSOutputStream.waitAndQueueCurrentPacket(DFSOutputStream.java:1475)
 - locked <0x00000007770010a0> (a java.util.LinkedList)
 at org.apache.hadoop.hdfs.DFSOutputStream.writeChunk(DFSOutputStream.java:1543)
 - locked <0x00000007770abdc8> (a org.apache.hadoop.hdfs.DFSOutputStream)
 at org.apache.hadoop.fs.FSOutputSummer.writeChecksumChunk(FSOutputSummer.java:175)
 at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:150)
 - locked <0x00000007770abdc8> (a org.apache.hadoop.hdfs.DFSOutputStream)
 at org.apache.hadoop.fs.FSOutputSummer.flushBuffer(FSOutputSummer.java:139)
 - eliminated <0x00000007770abdc8> (a org.apache.hadoop.hdfs.DFSOutputStream)
 at org.apache.hadoop.fs.FSOutputSummer.write1(FSOutputSummer.java:130)
 at org.apache.hadoop.fs.FSOutputSummer.write(FSOutputSummer.java:104)
 - locked <0x00000007770abdc8> (a org.apache.hadoop.hdfs.DFSOutputStream)
 at org.apache.hadoop.fs.FSDataOutputStream$PositionCache.write(FSDataOutputStream.java:59)
 at java.io.DataOutputStream.write(DataOutputStream.java:107)
 - locked <0x0000000777294690> (a org.apache.hadoop.hdfs.client.HdfsDataOutputStream)
 at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter.writeObject(TextOutputFormat.java:83)
 at org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter.write(TextOutputFormat.java:104)
 - locked <0x000000077722f338> (a org.apache.hadoop.mapreduce.lib.output.TextOutputFormat$LineRecordWriter)
 at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.write(MultipleOutputs.java:433)
 at org.apache.hadoop.mapreduce.lib.output.MultipleOutputs.write(MultipleOutputs.java:411)
 at AdMonitorDMReaderJob$Mapper.map(AdMonitorDMReaderJob.java:45)
 at AdMonitorDMReaderJob$Mapper.map(AdMonitorDMReaderJob.java:30)
 at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
 at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:772)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:339)
 at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:162)
 at java.security.AccessController.doPrivileged(Native Method)
 at javax.security.auth.Subject.doAs(Subject.java:415)
 at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1491)
 at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:157)

"VM Thread" prio=10 tid=0x000000000a02c000 nid=0x4c62 runnable 

"VM Periodic Task Thread" prio=10 tid=0x00002aaab401a800 nid=0x4c69 waiting on condition 

JNI global references: 276

jinfo

-- List all configuration info for current java process, including system properties and java-specified properties.

[hadoop@K1213 ~]$ jinfo 19552
Attaching to process ID 19552, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 23.6-b04
Java System Properties:

java.runtime.name = Java(TM) SE Runtime Environment
java.vm.version = 23.6-b04
sun.boot.library.path = /usr/java/jdk1.7.0_11/jre/lib/amd64
hadoop.root.logger = INFO,CLA
java.vendor.url = http://java.oracle.com/
java.vm.vendor = Oracle Corporation
path.separator = :
file.encoding.pkg = sun.io
java.vm.name = Java HotSpot(TM) 64-Bit Server VM
sun.os.patch.level = unknown
sun.java.launcher = SUN_STANDARD
user.country = CN
user.dir = /home/data4/hdfsdir/nm-local-dir/usercache/monitor/appcache/application_1417419796124_63366/container_1417419796124_63366_01_000063
java.vm.specification.name = Java Virtual Machine Specification
java.runtime.version = 1.7.0_11-b21
java.awt.graphicsenv = sun.awt.X11GraphicsEnvironment
os.arch = amd64
java.endorsed.dirs = /usr/java/jdk1.7.0_11/jre/lib/endorsed
line.separator = 

java.io.tmpdir = /home/data4/hdfsdir/nm-local-dir/usercache/monitor/appcache/application_1417419796124_63366/container_1417419796124_63366_01_000063/tmp
yarn.app.container.log.dir = /home/workspace/hadoop/logs/userlogs/application_1417419796124_63366/container_1417419796124_63366_01_000063
java.vm.specification.vendor = Oracle Corporation
os.name = Linux
log4j.configuration = container-log4j.properties
sun.jnu.encoding = UTF-8
java.library.path = /home/data4/hdfsdir/nm-local-dir/usercache/monitor/appcache/application_1417419796124_63366/container_1417419796124_63366_01_000063:/home/workspace/hadoop/lib/native:/usr/java/packages/lib/amd64:/usr/lib64:/lib64:/lib:/usr/lib
java.specification.name = Java Platform API Specification
java.class.version = 51.0
sun.management.compiler = HotSpot 64-Bit Tiered Compilers
os.version = 2.6.18-308.el5
yarn.app.container.log.filesize = 209715200
user.home = /home/hadoop
user.timezone = Asia/Shanghai
java.awt.printerjob = sun.print.PSPrinterJob
file.encoding = UTF-8
java.specification.version = 1.7
user.name = hadoop
java.class.path = ...
java.vm.specification.version = 1.7
sun.arch.data.model = 64
sun.java.command = org.apache.hadoop.mapred.YarnChild 192.168.7.86 14667 attempt_1417419796124_63366_m_000033_0 63
java.home = /usr/java/jdk1.7.0_11/jre
user.language = zh
java.specification.vendor = Oracle Corporation
awt.toolkit = sun.awt.X11.XToolkit
java.vm.info = mixed mode
java.version = 1.7.0_11
java.ext.dirs = /usr/java/jdk1.7.0_11/jre/lib/ext:/usr/java/packages/lib/ext
sun.boot.class.path = /usr/java/jdk1.7.0_11/jre/lib/resources.jar:/usr/java/jdk1.7.0_11/jre/lib/rt.jar:/usr/java/jdk1.7.0_11/jre/lib/sunrsasign.jar:/usr/java/jdk1.7.0_11/jre/lib/jsse.jar:/usr/java/jdk1.7.0_11/jre/lib/jce.jar:/usr/java/jdk1.7.0_11/jre/lib/charsets.jar:/usr/java/jdk1.7.0_11/jre/lib/jfr.jar:/usr/java/jdk1.7.0_11/jre/classes
java.vendor = Oracle Corporation
file.separator = /
java.vendor.url.bug = http://bugreport.sun.com/bugreport/
sun.io.unicode.encoding = UnicodeLittle
sun.cpu.endian = little
sun.cpu.isalist = 

VM Flags:

-XX:+UseSerialGC -Xms1024M -Xmx3096m -XX:PermSize=64m -XX:MaxPermSize=128M -Djava.io.tmpdir=/home/data4/hdfsdir/nm-local-dir/usercache/monitor/appcache/application_1417419796124_63366/container_1417419796124_63366_01_000063/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/home/workspace/hadoop/logs/userlogs/application_1417419796124_63366/container_1417419796124_63366_01_000063 -Dyarn.app.container.log.filesize=209715200 -Dhadoop.root.logger=INFO,CLA

strace

-- Show calls on function of underlying system (Linux). This is suitable for all processes in Linux.

[hadoop@K1213 ~]$ strace -p 19552
Process 19552 attached - interrupt to quit
futex(0x408a09d0, FUTEX_WAIT, 19553, NULL <unfinished ...>
Process 19552 detached
[hadoop@K1213 ~]$ strace -p 19553
Process 19553 attached - interrupt to quit
futex(0x9fada54, FUTEX_WAIT_PRIVATE, 1747, NULL) = 0
futex(0x9fada28, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x9fada54, FUTEX_WAIT_PRIVATE, 1749, NULL) = 0
futex(0x9fada28, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x9fada54, FUTEX_WAIT_PRIVATE, 1751, NULL) = 0
futex(0x9fada28, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x9fada54, FUTEX_WAIT_PRIVATE, 1753, NULL) = 0
futex(0x9fada28, FUTEX_WAKE_PRIVATE, 1) = 0
futex(0x9fada54, FUTEX_WAIT_PRIVATE, 1755, NULL) = 0
futex(0x9fada28, FUTEX_WAKE_PRIVATE, 1) = 0
read(160, "\0\2\4\4\0\31", 6)           = 6
read(160, "\t\0\0 \5\0\0\0\0\21\220\2\0\0\0\0\0\0\30\0%\0\0\2\0\243\255X\321\20\314g"..., 132121) = 68529
read(160, 0x2aaab4ca2457, 63592)        = -1 EAGAIN (Resource temporarily unavailable)
epoll_ctl(175, EPOLL_CTL_ADD, 160, {EPOLLIN, {u32=160, u64=12336768231720091808}}) = 0
epoll_wait(175, {{EPOLLIN, {u32=160, u64=12336768231720091808}}}, 8192, 3600000) = 1
epoll_ctl(175, EPOLL_CTL_DEL, 160, {0, {u32=160, u64=719115788937592992}}) = 0
epoll_wait(175, {}, 8192, 0)            = 0
read(160, "\332a\375\f\237\331\235\31YD2\304\5\362;#\232A\225\37?\203<w0%\371%S/\275\232"..., 63592) = 10136
read(160, 0x2aaab4ca4bef, 53456)        = -1 EAGAIN (Resource temporarily unavailable)
epoll_ctl(175, EPOLL_CTL_ADD, 160, {EPOLLIN, {u32=160, u64=4371809895922532512}}) = 0
...

jstat

-- Java Virtual Machine Statistics Monitoring Tool, which is used to monitor information on compile, GC, etc.

The most frequently-used way for me is to get GC statistical information from a current java process:

[hadoop@K1213 ~]$ jstat -gc 11036
 S0C    S1C    S0U    S1U      EC       EU        OC         OU       PC     PU    YGC     YGCT    FGC    FGCT     GCT   
34944.0 34944.0 763.4   0.0   279616.0 218484.1  699072.0   572871.8  65536.0 22317.0     12    0.423   0      0.000    0.423
[hadoop@K1213 ~]$ jstat -gcutil 11036
  S0     S1     E      O      P     YGC     YGCT    FGC    FGCT     GCT   
  0.00   7.17  93.65  81.95  34.06     13    0.432     0    0.000    0.432

The description for all columns is listed here, and you could find more detailed usage for 'jstat' in that webpage. As for the above command, I always pay more attention to FGC (full GC times). Chances are that there are something wrong like memory leak in our code provided that FGC is relatively a large number.

jmap

-- Retrieve current heap space info, including classes, amount of objects, occupied memory, etc. It's just like getting all the info from 'jhat' at runtime.

[hadoop@k1291 ~]$ jmap -histo 11711 | head -n 50

 num     #instances         #bytes  class name
----------------------------------------------
   1:         22834      144531152  [B
   2:        699375      123144384  [D
   3:       1153782       94102184  [Ljava.util.HashMap$Entry;
   4:       1827000       89984520  [C
   5:       1148737       64329272  java.util.HashMap
   6:        331311       58298464  [J
   7:       1423855       34172520  java.lang.String
   8:        331138       29140144  com.miaozhen.yo.tcpreporter.report.Counter
   9:        659720       21111040  java.util.HashMap$Entry
  10:        271052       13010496  java.util.StringTokenizer
  11:        151911       10937592  com.miaozhen.yo.tcpreporter.report.RFPanel
  12:        243543        9741720  java.util.TreeMap$Entry
  13:         47619        6636000  <constMethodKlass>
  14:         47619        6486248  <methodKlass>
  15:        236361        5672664  java.lang.Long
  16:        168142        5380544  com.miaozhen.yo.tcpreporter.history.HistoryCount
  17:          3844        4571976  <constantPoolKlass>
  18:        165569        3973656  com.miaozhen.yo.tcpreporter.report.RFCounter
  19:         92516        3700640  java.util.HashMap$EntryIterator
  20:        146659        3519816  java.lang.StringBuffer
  21:         75587        3023480  com.miaozhen.app.MzSequenceFile$SMeta
  22:          3844        2944464  <instanceKlassKlass>
  23:        105924        2542176  java.lang.StringBuilder
  24:          3185        2520192  <constantPoolCacheKlass>
  25:          3288        2124824  [I
  26:         42750        1710000  sun.misc.FloatingDecimal
  27:         17819        1298352  [Ljava.lang.Object;
  28:          9867        1105104  com.miaozhen.yo.tcpreporter.Purelog
  29:         41796        1003104  com.miaozhen.tools.MyDouble
  30:         51896         830336  java.util.HashMap$EntrySet
  31:         33984         815616  java.lang.Double
  32:         15043         633656  [Ljava.lang.String;
  33:          1291         593104  <methodDataKlass>
  34:         11493         551664  java.nio.HeapByteBuffer
  35:         11487         551376  java.nio.HeapCharBuffer
  36:         33945         543120  com.miaozhen.yo.tcpreporter.history.HistoryLevelOne
  37:          4158         503104  java.lang.Class
  38:         19564         469536  java.util.Date
  39:          9777         469296  java.util.TreeMap$AscendingSubMap
  40:         11487         459480  java.util.ArrayList$SubList
  41:         11487         459480  java.util.ArrayList$SubList$1
  42:          6415         407520  [S
  43:          6775         369960  [[I
  44:         11816         283584  java.util.ArrayList
  45:         15982         255712  com.miaozhen.yo.tcpreporter.history.HistoryLevelTwo
  46:         10619         254856  org.apache.hadoop.io.Text
  47:          1571         175952  org.apache.hadoop.hdfs.protocol.DatanodeInfo

[hadoop@k1291 ~]$ jmap -heap 11711 
Attaching to process ID 11711, please wait...
Debugger attached successfully.
Server compiler detected.
JVM version is 23.6-b04

using thread-local object allocation.
Mark Sweep Compact GC

Heap Configuration:
   MinHeapFreeRatio = 40
   MaxHeapFreeRatio = 70
   MaxHeapSize      = 3221225472 (3072.0MB)
   NewSize          = 1310720 (1.25MB)
   MaxNewSize       = 17592186044415 MB
   OldSize          = 5439488 (5.1875MB)
   NewRatio         = 2
   SurvivorRatio    = 8
   PermSize         = 21757952 (20.75MB)
   MaxPermSize      = 85983232 (82.0MB)
   G1HeapRegionSize = 0 (0.0MB)

Heap Usage:
New Generation (Eden + 1 Survivor Space):
   capacity = 316669952 (302.0MB)
   used     = 276925888 (264.09710693359375MB)
   free     = 39744064 (37.90289306640625MB)
   87.44937315681912% used
Eden Space:
   capacity = 281542656 (268.5MB)
   used     = 266371136 (254.03131103515625MB)
   free     = 15171520 (14.46868896484375MB)
   94.61128902612896% used
From Space:
   capacity = 35127296 (33.5MB)
   used     = 10554752 (10.0657958984375MB)
   free     = 24572544 (23.4342041015625MB)
   30.04715193563433% used
To Space:
   capacity = 35127296 (33.5MB)
   used     = 0 (0.0MB)
   free     = 35127296 (33.5MB)
   0.0% used
tenured generation:
   capacity = 703594496 (671.0MB)
   used     = 625110416 (596.1517486572266MB)
   free     = 78484080 (74.84825134277344MB)
   88.84526805621856% used
Perm Generation:
   capacity = 25690112 (24.5MB)
   used     = 25664256 (24.475341796875MB)
   free     = 25856 (0.024658203125MB)
   99.89935427295919% used

8870 interned Strings occupying 738840 bytes.

By using the following command, the output HeapDump file can be analysed by 'jhat'.

jmap -dump:live,format=b,file=/home/data8/heapdump.log [PID]

HeapDumpOnOutOfMemoryError

When facing OOM in heap space error, we could simply add the following arguments to our java programming at the startup. In this way, the heap space snapshot is dumped out when OOM occurs. Apply `jhat` command to the HeapDump file for analysing.

-XX:+HeapDumpOnOutOfMemoryError  -XX:HeapDumpPath=/home/data8/heapspaceerror.out

FYI: If we intend to specify the above arguments in MapReduce program, we should do it in the following way:

hadoop jar mr.jar MainClass -Dmapreduce.map.java.opts="-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/home/data8/heapspaceerror.out" argu1 argu2

jhat

-- HAT, abbreviation for Heap Analysis Tool, which is used to analyse HeapDump file.

$ jhat heapdump.out
Reading from log.log...  
Dump file created Wed Mar 14 12:01:45 CST 2012  
Snapshot read, resolving...  
Resolving 6762 objects...  
Chasing references, expect 1 dots.  
Eliminating duplicate references.  
Snapshot resolved.  
Started HTTP server on port 7000  
Server is ready.

When the server is ready, we could check out the result in browser: http://hostname:7000/. In the webpage, all classes which exists in heap space is listed here, but it is somewhat useless for us. Luckily, some portals which provide with more intuitive analytic result can be found at the end of the webpage.

We could step into "Show instance counts for all classes" to see the count of all objects for each class individually, which is sorted in the webpage.

As we can see in current scenario, there are too many LoggingEvent that results in OOM in heap space, which provides us with a hint where the problem might be.

© 2014-2017 jason4zhu.blogspot.com All Rights Reserved
If transfering, please annotate the origin: Jason4Zhu

Saturday, December 20, 2014

Way To Crack WEP Wifi Via Aircrack-ng In Mac OSX

Prerequisite

Symbolic-link airport, the underlying implementation of the wireless-connection-related module, to /usr/sbin directory in Mac OSX.

sudo ln -s /System/Library/PrivateFrameworks/Apple80211.framework/Versions/Current/Resources/airport /usr/sbin/airport

Before installing aircrack-ng, which is our essential tool to crack WEP wifi password, we have to install MacPorts (a package management tool in Mac OSX), Xcode and XCode command line tools at first. There're piles of solutions on how to install all the above tools from Google. Only one thing should be mentioned, that we need to execute `sudo port -v selfupdate` after installing MacPorts so as to update the source index.

Eventually, install aircrack-ng via a simple one-line command. After which, we could execute `aircrack-ng` to double-check whether it is successfully installed or not.

sudo port install aircrack-ng

Cracking Process

Firstly, use `airport` to catch BSSID and CHANNEL of the targeted WEP wifi.

Then sniff this wifi via the following command, whose result will be saved to /tmp/*.cap. (en0 is my interface name, which could be retrieved via `ifconfig` command; 6 is the CHANNEL of the targeted WEP wifi)

sudo airport en0 sniff 6

After collecting data packets for some time, invoke the following command in order to perform the cracking process. ('00:25:86:4c:c0:26' is the BSSID)

aircrack-ng -1 -a 1 -b 00:25:86:4c:c0:26 /tmp/airportSniffrekfv1.cap

If data packets is accumulated enough, we could finally crack the wifi password as below:

As we can see, the password is '3030303030' in plain text.

© 2014-2017 jason4zhu.blogspot.com All Rights Reserved
If transfering, please annotate the origin: Jason4Zhu

Wednesday, December 10, 2014

Way To Deploy Your Own Maven Dependency On Github

Recently, there's a need to share my own maven dependency with friends. Here's what I did in order to make github as a private maven repository.

Insert the following configuration in pom.xml which is nested in the project that you are ready to share via maven.

<build>
    <plugins>
        <plugin>
            <artifactId>maven-deploy-plugin</artifactId>
            <version>2.8.1</version>
            <configuration>
                <altDeploymentRepository>internal.repo::default::file://${project.build.directory}/mvn-repo</altDeploymentRepository>
            </configuration>
        </plugin>
    </plugins>
</build>

Execute `mvn clean deploy`, then you can find "target/mvn-repo" directory from the root of your project.

Create a github repository, execute the following command to push the project to github:

git init
git add .
git commit -m "update"
git remote add origin https://github.com/[github_repository]
git push -u origin master

Eventually, add configuration in pom.xml, whose project is going to refer to that maven dependency, as below.

<repositories>
    <repository>
        <id>miaozhen-dm-sdk-repo</id>
        <url>https://github.com/judking4/com.miaozhen.dm.sdk/tree/master/target/mvn-repo</url>
    </repository>
</repositories>
...
<dependencies>
    <dependency>
        <groupId>com.myproject.dm.sdk</groupId>
        <artifactId>com.myproject.dm.sdk</artifactId>
        <version>0.0.1-SNAPSHOT</version>
    </dependency>
</dependencies>

In which, groupId, artifactId as well as version can all be found in either the mvn-repo directory, or the pom.xml of the maven dependency project.

© 2014-2017 jason4zhu.blogspot.com All Rights Reserved
If transfering, please annotate the origin: Jason4Zhu

Monday, December 8, 2014

Some Notes On Java Access Modifiers: Public Protected, Default And Private

The following table can, in general, distinct access modifiers clearly.

Modifier    | Class | Package | Subclass | Other package
public      |  y    |    y    |    y     |   y
protected   |  y    |    y    |    y     |   n
default     |  y    |    y    |    n     |   n
private     |  y    |    n    |    n     |   n

y: accessible
n: not accessible

But there are some details that have to be noted down.

#1

A normal class has only two modifiers, namely, 'public' and 'default' (no modifier), which is obvious in semantics that a private or protected class is not congruent with the meaning of the above table. Whereas inner class can be defined by all the four modifiers.

#2

It is probably worth pointing out that in the case of 'default' (no modifier), whether or not the subclass can see it's superclass's methods/fields depends on the location of the subclass. If the subclass is in another package, then the answer is it can't. If the subclass is in the same package then it CAN access the superclass methods/fields.

© 2014-2017 jason4zhu.blogspot.com All Rights Reserved 

If transfering, please annotate the origin: Jason4Zhu

Thursday, December 4, 2014

A Record On The Process Of Adding DataNode To Hadoop Cluster

Procedure of checking and configuring linux node

1. Check on hostname

To check whether current hostname is exactly what it should be.

hostname

2. Check on the type of file system

To make sure it is the same as all the other nodes in Hadoop cluster.

df -T

3. Change owner of all data disk to hadoop user

chown -R hadoop:hadoop /home/data*

4. Check and backup disk mount info

We can simply execute the output of mount.sh in case some mounted filepaths are lost upon restart of node.

su root

--mount.sh--
n=1 ; for i in a b c d e f g h i j k l ; do a=`/sbin/blkid -s UUID | grep ^/dev/sd$i | awk '{print $2}'` ; echo mount $a /home/data$n ; n=`echo $n+1|bc` ; done

> bash mount.sh
mount UUID="09c42017-9308-45c3-9509-e77a2e99c732" /home/data1
mount UUID="72461da2-b0c0-432a-9b65-0ac5bc5bc69a" /home/data2
mount UUID="6d447f43-b2db-4f69-a3b2-a4f69f2544ea" /home/data3
mount UUID="37ca4fb8-377c-493d-9a4c-825f1500ae52" /home/data4
mount UUID="53334c93-13ff-41f5-8688-07023bd6f11a" /home/data5
mount UUID="10fa31f7-9c29-4190-8ecd-ec893d59634c" /home/data6
mount UUID="fe28b8dd-ff3b-49d9-87c6-6eee9f389966" /home/data7
mount UUID="5201d24b-9310-4cff-b3ad-5b09e47780a5" /home/data8
mount UUID="d3b85455-8b94-4817-b43e-69481f9c13c4" /home/data9
mount UUID="6f2630f1-7cfe-4cac-b52d-557f46779539" /home/data10
mount UUID="bafc742d-1477-439a-ade4-29711c5db840" /home/data11
mount UUID="bf6e36d8-1410-4547-853c-f541c9a07e52" /home/data12

We can append the output of mount.sh into /etc/rc.local, in which way the mount command will be invoked automatically every time node starts up.

5. Check on the version of operating system

lsb_release -a

6. Optimizing TCP parameters in sysctl

Append the following content to /etc/sysctl.conf.

fs.file-max = 800000
net.core.rmem_default = 12697600
net.core.wmem_default = 12697600
net.core.rmem_max = 873800000
net.core.wmem_max = 655360000
net.ipv4.tcp_rmem = 8192 262144 4096000
net.ipv4.tcp_wmem = 4096 262144 4096000
net.ipv4.tcp_mem = 196608 262144 3145728
net.ipv4.tcp_max_orphans = 300000
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.ip_local_port_range = 1025 65535
net.ipv4.tcp_max_syn_backlog = 100000
net.ipv4.tcp_fin_timeout = 30
net.ipv4.tcp.keepalive_time = 1200
net.ipv4.tcp_max_tw_buckets = 5000
net.ipv4.netfilter.ip_conntrack_tcp_timeout_established = 1500
net.core.somaxconn=32768
vm.swappiness=0

Issue the following command to validate the previous setting.

/sbin/sysctl -p

# List all cuurent parameters to double-check
/sbin/sysctl -a

7. Max connections to a file

Check on current max connections to a file:

ulimit -n

Appending the following content to /etc/security/limits.confs so as to change it to 100000.

*      soft    nofile  100000
*      hard    nofile  100000

8. Check and sync /etc/hosts

Checkout a current /etc/hosts file from one of the existing nodes in Hadoop cluster, namely canonical node. Append the newly-added node's host in this file and synchronize it to all nodes.

9. Revise locale to en_US.UTF-8

Append in /etc/profile will just do the work.

export LANG=en_US.UTF-8
export LC_CTYPE=en_US.UTF-8
export LC_NUMERIC=en_US.UTF-8
export LC_TIME=en_US.UTF-8
export LC_COLLATE=en_US.UTF-8
export LC_MONETARY=en_US.UTF-8
export LC_MESSAGES=en_US.UTF-8
export LC_PAPER=en_US.UTF-8
export LC_NAME=en_US.UTF-8
export LC_ADDRESS=en_US.UTF-8
export LC_TELEPHONE=en_US.UTF-8
export LC_MEASUREMENT=en_US.UTF-8
export LC_IDENTIFICATION=en_US.UTF-8
export LC_ALL=en_US.UTF-8

10. Transfer .ssh directory to newly-added node

Since all nodes are sharing one set of .ssh directory, in this way, no-auth ssh among all the nodes is simple to achieve: Append id_rsa.pub in current authorized_keys and spread .ssh directory to all nodes.

11. Deploy Java & Hadoop environment

Copy java directory from canonical node to newly-added node, append the following environment variable to /etc/profile:

export JAVA_HOME=/usr/java/jdk1.7.0_11
export CLASSPATH=.:$CLASSPATH:$JAVA_HOME/lib:$JAVA_HOME/jre/lib
export PATH=$PATH:$JAVA_HOME/bin:$JAVA_HOME/jre/bin

Likewise, do the same to hadoop directory:

export HADOOP_HOME=/home/workspace/hadoop
export JAVA_LIBRARY_PATH=$JAVA_LIBRARY_PATH:$HADOOP_HOME/lib/native/
export PATH=$PATH:$HADOOP_HOME/bin

At the same time, mkdir and chown the paths configured by dfs.datanode.data.dir parameter in hdfs-site.xml:

su hadoop
for i in {1..12}
do
  mkdir -p /home/data$i/hdfsdir/data
  chmod -R 755 /home/data$i/hdfsdir
done

Lastly, append newly-added host to etc/hadoop/slaves and synchronize to all nodes.

su hadoop
for i in $(cat $HADOOP_HOME/etc/hadoop/slaves  | grep -v "#")
do
 echo '';
 echo $i;
 scp $HADOOP_HOME/etc/hadoop/slaves hadoop@$i:/home/workspace/hadoop/etc/hadoop/;
done

12. Open iptables

This is well-explained in another post: Configure Firewall In Iptables For Hadoop Cluster.

13. Launch Hadoop services

Eventually, we simple invoke the following two commands so as to start DataNode and NodeManager service.

su hadoop
$HADOOP_HOME/sbin/hadoop-daemon.sh start datanode
$HADOOP_HOME/sbin/yarn-daemon.sh start nodemanager

After that, check whether the above two processes exist or not and look through the corresponding logs in $HADOOP_HOME/logs directory for the purpose of double-check.

The python script, which is the implementation of all the procedures listed above, can be found in my project in github.

© 2014-2017 jason4zhu.blogspot.com All Rights Reserved
If transfering, please annotate the origin: Jason4Zhu

Tuesday, December 2, 2014

A Charset Encoding Problem Related With 'file.encoding' And 'sun.jnu.encoding' Parameters In Java When Executing Mapreduce Job On A Datanode

When executing MapReduce job, we found that there are some Chinese data in log is displayed as '??????', which eventually affects our final results.

We located to the suspicious DataNode and run the following Java program so as to check out the runtime encoding-related parameters.

// --FileEncodingTest.java--

import java.util.*;
import java.net.URLDecoder;
public class FileEncodingTest{
public static void main(String[] args) throws Exception {
        Properties properties = System.getProperties();
        for (Object key : properties.keySet()) {
            System.out.println(String.format("propertyName: %s, propertyValue: %s", key, properties.getProperty(key.toString())));
        }
        String originContent = "%E4%B9%90%E8%A7%86%E8%A7%86%E9%A2%91";
        System.out.println(URLDecoder.decode(originContent, "utf-8"));

    }
}

// --Shell Command--
$ javac FileEncodingTest.java
$ java FileEncodingTest | grep encod --color
propertyName: file.encoding.pkg, propertyValue: sun.io
propertyName: sun.jnu.encoding, propertyValue: ANSI_X3.4-1968
propertyName: file.encoding, propertyValue: ANSI_X3.4-1968
propertyName: sun.io.unicode.encoding, propertyValue: UnicodeLittle

We could see that the default 'sun.jnu.encoding' and 'file.encoding' parameters are "ANSI_X3.4-1968", which is not exactly what we expect.

Curiously enough, default value of the above two parameters are "UTF-8" when my colleague ssh to the same DataNode from his own machine. It is relevant to the configuration of ssh-client machine!

We found in this post that all the LC_* parameters in local machine will be carried to the remote node provided they are not explicitly set in the remote node. After checking `locale` in local machine, the trouble spot is pinpointed.

# --command invoked in local machine--
$ locale
LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=

# --command invoked in remote DataNode--
$ locale
locale: Cannot set LC_CTYPE to default locale: No such file or directory
locale: Cannot set LC_ALL to default locale: ?????????
LANG=zh_CN.UTF-8
LC_CTYPE=UTF-8
LC_NUMERIC="zh_CN.UTF-8"
LC_TIME="zh_CN.UTF-8"
LC_COLLATE="zh_CN.UTF-8"
LC_MONETARY="zh_CN.UTF-8"
LC_MESSAGES="zh_CN.UTF-8"
LC_PAPER="zh_CN.UTF-8"
LC_NAME="zh_CN.UTF-8"
LC_ADDRESS="zh_CN.UTF-8"
LC_TELEPHONE="zh_CN.UTF-8"
LC_MEASUREMENT="zh_CN.UTF-8"
LC_IDENTIFICATION="zh_CN.UTF-8"
LC_ALL=

As we can see from above, 'LC_CTYPE' in local machine is set to "UTF-8", which is decent in MAC OS X. When it is carried to remote DataNode, it is not recognized by CentOS, thus the above "LC_CTYPE=UTF-8" doesn't make any sense, 'ANSI_X3.4-1968' is applied when running Java program.

There are two ways to solve this problem, the first one is less competitive than the second one:

#1. Set the above two java parameter explicitly every time we run a Java program, which is presented in this post:

java -Dsun.jnu.encoding=UTF-8 -Dfile.encoding=UTF-8 FileEncodingTest | grep encod --color

#2. Set locale parameters explicitly so as to make it not ssh-client-env-related. The only thing we need to do is to append the following content in "/etc/profile":

$ su root
$ vim /etc/profile
export LANG=en_US.UTF-8
export LC_CTYPE=en_US.UTF-8
export LC_NUMERIC=en_US.UTF-8
export LC_TIME=en_US.UTF-8
export LC_COLLATE=en_US.UTF-8
export LC_MONETARY=en_US.UTF-8
export LC_MESSAGES=en_US.UTF-8
export LC_PAPER=en_US.UTF-8
export LC_NAME=en_US.UTF-8
export LC_ADDRESS=en_US.UTF-8
export LC_TELEPHONE=en_US.UTF-8
export LC_MEASUREMENT=en_US.UTF-8
export LC_IDENTIFICATION=en_US.UTF-8
export LC_ALL=en_US.UTF-8

In this time, the above two java parameters will alway be set to "UTF-8" when running java programs no matter from which client we ssh to the DataNode.

© 2014-2017 jason4zhu.blogspot.com All Rights Reserved
If transfering, please annotate the origin: Jason4Zhu

Saturday, November 29, 2014

Configure Firewall In Iptables For Hadoop Cluster

Security Problem

As for HDFS, which is configured along with Hadoop, the 'dfs.namenode.http-address' parameter in hdfs-site.xml specifies an URL which is used to monitoring HDFS:

  <property>
    <name>dfs.nameservices</name>
    <value>ns1</value>
  </property>
 
  <property>
    <name>dfs.ha.namenodes.ns1</name>
    <value>nn1,nn2</value>
  </property>
 
  <property>
    <name>dfs.namenode.rpc-address.ns1.nn1</name>
    <value>k1201.hide.cn:8020</value>
  </property>

  <property>
    <name>dfs.namenode.rpc-address.ns1.nn2</name>
    <value>k1202.hide.cn:8020</value>
  </property>

  <property>
    <name>dfs.namenode.http-address.ns1.nn1</name>
    <value>k1201.hide.cn:50070</value>
  </property>

  <property>
    <name>dfs.namenode.http-address.ns1.nn2</name>
    <value>k1202.hide.cn:50070</value>
  </property>

After HDFS service is on, the HDFS monitoring webpage just looks like below:

If clicking 'Browse the filesystem' , we could explore and download ALL DATA stored in HDFS, which is virtually not what we want.

Configuration For Iptables

Firstly, erase all current iptables rules should iptbales is on. The reason that we have to explicitly state INPUT, FORWARD, OUTPUT to ACCEPT is that no packet can pass into that machine after we execute flush if one of them is set default to DROP.

/sbin/iptables --policy INPUT ACCEPT
/sbin/iptables --policy FORWARD ACCEPT
/sbin/iptables --policy OUTPUT ACCEPT
/sbin/iptables --flush

The final target of our settings in iptables is to block all kinds of TCP connections (apart from SSH) from external IP address to nodes in Hadoop cluster, whereas the reverse direction is allowed. In the meantime, connections among nodes in Hadoop cluster should not be restricted. Here's the command to achieve this:

iptables -L -v -n
iptables -A INPUT -s 192.168.0.0/255.255.0.0 -j ACCEPT
iptables -A OUTPUT -d 192.168.0.0/255.255.0.0 -j ACCEPT
iptables -A INPUT -s 127.0.0.1 -j ACCEPT
iptables -A OUTPUT -d 127.0.0.1 -j ACCEPT
iptables -A INPUT -p tcp --dport ssh -j ACCEPT
iptables -A OUTPUT -p tcp --sport ssh -j ACCEPT
iptables -A INPUT -m state --state ESTABLISHED,RELATED,UNTRACKED -j ACCEPT
iptables --policy INPUT DROP
iptables --policy OUTPUT ACCEPT

`-L` will show current rules set in iptables, and will start iptables provided the service is off. `-v` will show detailed packet and byte info. `-n` will disable all DNS resolution in the output results.
The next two lines of rule will allow any packet that are from node with LAN IP 192.168.*.* going in and out.
The fourth and fifth line of rule accepts any packet that are from localhost/127.0.0.1.
The sixth and seventh line of rule will accept ssh connection from any source, namely LAN IP and external IP.
The eighth line of rule will allow all TCP INPUT whose state is ESTABLISHED, RELATED or UNTRACKED, purpose of which is allowing all Hadoop nodes to start TCP connection with external IP and the external one can send back packets successfully.
The last two lines of rule set the default policy of INPUT and OUTPUT to DROP, so as to block anything if a packet is not matched by the above rules.

After setting, run the following command again to double check whether rules have been configured successfully or not.

$ iptables -L -v -n

Chain INPUT (policy DROP 199 packets, 66712 bytes)
 pkts bytes target     prot opt in     out     source               destination         
4243K 3906M ACCEPT     all  --  *      *       192.168.0.0/16       0.0.0.0/0           
  409 47240 ACCEPT     all  --  *      *       127.0.0.1            0.0.0.0/0
    0     0 ACCEPT     tcp  --  *      *       0.0.0.0/0            0.0.0.0/0           tcp dpt:22
  119 19947 ACCEPT     all  --  *      *       0.0.0.0/0            0.0.0.0/0           state RELATED,ESTABLISHED,UNTRACKED 

Chain FORWARD (policy ACCEPT 0 packets, 0 bytes)
 pkts bytes target     prot opt in     out     source               destination         

Chain OUTPUT (policy ACCEPT 131 packets, 9395 bytes)
 pkts bytes target     prot opt in     out     source               destination         
1617K 6553M ACCEPT     all  --  *      *       0.0.0.0/0            192.168.0.0/16      
  409 47240 ACCEPT     all  --  *      *       0.0.0.0/0            127.0.0.1
    0     0 ACCEPT     tcp  --  *      *       0.0.0.0/0            0.0.0.0/0           tcp spt:22

If all is correct, remember to make it persistent via the following command. If not, iptables will reload the persistent iptables-config file, which may be a stale version of iptables rules, at the time the machine restarts.

/sbin/iptables-save > /etc/sysconfig/iptables

From my own Mac, which is external IP of course, run the following command in order to check whether the specific port in that Hadoop node is available or not.

telnet 644v(1/2/3).hide.cn 22                 #Success!
telnet 644v(1/2/).hide.cn 50070               #Fail
telnet 644v(1/2/3).hide.cn 50075              #Fail
telnet 644v(1/2/).hide.cn 9000                #Fail

# Or in a more elegant way:
$ nmap -sT -sV -Pn 644v*.hide.cn
Starting Nmap 6.47 ( http://nmap.org ) at 2014-12-01 10:40 HKT
Nmap scan report for *.hide.cn (xx.xx.xx.xx)
Host is up (0.017s latency).
rDNS record for xx.xx.xx.xx: undefine.inidc.com.cn
Not shown: 999 filtered ports
PORT   STATE SERVICE VERSION
22/tcp open  ssh     OpenSSH 4.3 (protocol 2.0)

In the above, 50070 is set by 'dfs.namenode.http-address', 50075 is set by 'dfs.datanode.http.address', and 9000 is set by 'dfs.namenode.rpc-address'. All these parameters are configured in hdfs-site.xml.

After configuring, we SSH to a node in Hadoop cluster, executing `hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.2.0.jar wordcount input output` and it turns out successful.

Proxy Setting For Hadoop Cluster

After starting iptables, we are not able to access HDFS and YARN monitoring webpages from external IP anymore. The easiest way to solve this problem elegantly is to set dynamic port forwarding via `ssh`.

Only two commands are required to invoke on our own PC:

netstat -an | grep 7001
ssh -D 7001 supertool@644v3.hide.cn

The purpose of first command is to check whether port 7001 has been occupied by other programs. If not, the second command will create a socks tunnel from our own PC to the node in Hadoop cluster via ssh. All packets sending to port 7001 in our PC will forward to that node in Hadoop cluster.

As for chrome, we could configure it in SwitchySharp, which is quite simple and handy, as below:

However, since the node in Hadoop cluster is not accessible to external IP, we are not capable of surfing the Internet like youtube or twitter as well as monitoring on the HDFS or YARN webpage simultaneously provided configuration of proxy is set in the above manner.

Consequently, I write a custom PAC file which looks like below:

function FindProxyForURL(url,host)
{   
   if(host.indexOf("hide") > -1)
       return "SOCKS5 127.0.0.1:7001";
   return "DIRECT";
}

In which, "hide" is the keyword of our host for HDFS and YARN monitoring webpage. If "hide" appears in the host name, use the SOCKS5 proxy, and direct connection is applied otherwise.

Finally, import this PAC file in SwitchySharp:

If this proxy is set on another node rather than localhost, we have to add '-g' parameter, which is well-explained in document: "Allows remote hosts to connect to local forwarded ports", to accept external IP other than mere localhost.

ssh -g -D 7001 supertool@644v3.hide.cn

References:

Sunday, November 23, 2014

~/.profile is not being loaded automatically when opening a new terminal in Mac OS X

I intend to put some environment variables in my '~/.profile' hoping to load them automatically every time I open a new terminal in Mac OS X, but it doesn't work by checking environment variables via `env` command.

The manual page shipping with Mac OS X explains this issue clearly:

it looks for ~/.bash_profile, ~/.bash_login, and ~/.profile, in that order,
and reads and executes commands from the first one that exists and is readable.

I found that '~/.bash_profile' exists, hence the terminal will not go on finding and loading my '~/.profile'. It's recommended that we append the following code in '~/.bash_profile':

if [ -f ~/.profile ]; then
    source ~/.profile
fi

Saturday, November 22, 2014

An Explanation On Too Many Socket Connections In CLOSE_WAIT State When Mapreduce Applications Are Running In Hadoop

Occasionally, we found that there are too many TCP connections with state CLOSE_WAIT on DataNode when our daily MapReduce tasks are running in Hadoop. At some time, the amount can be approximately up to 20000.

/usr/sbin/lsof -i tcp | grep -E 'CLOSE_WAIT|COMMAND'

Firstly, we should be familiar with the process of TCP three-way handshake as well as four-way termination, which is well illustrated in the following graph.

Let's explain more detail on four-way termination process which is related to our problem.

When data transfer between client and server is done, whoever calls close method first, it, as the 'active' end, will send a 'FIN M' to the other one, and transit to FIN_WAIT_1 state.

The 'passive' end sends a 'ACK M+1', after receiving which the 'active' end goes into state 'FIN_WAIT_2', back to the 'active' one and transits into state CLOSE_WAIT right after it receives the 'FIN M' signal.

When the 'passive' end invokes close method and sends 'FIN N' to the 'active' one again, the 'active' one transits to 'TIME_WAIT' state. After waiting for tcp_time_wait_interval amount of time to ensure that there are no leftover segments, the 'active' end send 'ACK N+1' back to the 'passive' one, and a TCP connection terminates.

After getting some general understanding of TCP four-way termination, I did an experiment to simulate the above process by reading file from HDFS. The code snippet is as below:

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IOUtils;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;

/**
 * Created by jasonzhu on 20/11/14.
 */
public class HdfsCloseWait {
    public static void main(String[] args) throws IOException, InterruptedException {
        Path pt=new Path("hdfs://ns1/user/supertool/zhudi/input/blk_1073752163");
        FileSystem fs = FileSystem.get(new Configuration());
        BufferedReader br=new BufferedReader(new InputStreamReader(fs.open(pt)));
        String line;
        line=br.readLine();
        System.out.println("sleep at phase 1");
        Thread.sleep(1*30*1000);
        while (line != null) {
            System.out.println(line);
            line = br.readLine();

        }
        System.out.println("sleep at phase 2");
        Thread.sleep(1*30*1000);
        IOUtils.closeStream(br);
        fs.close();
        System.out.println("sleep at phase 3");
        Thread.sleep(10*30*1000);
        System.out.println("done=["+pt.toString()+"]");
    }
}

Run it with command:

hadoop jar com.judking.mr-1.0-SNAPSHOT-jar-with-dependencies.jar HdfsCloseWait

At the same time, monitor TCP connection for this java process on every phase, which is printed out in my code, via command:

#get the PID of this java process
/usr/java/jdk1.7.0_11/bin/jps

#either of the following command will work for monitoring TCP connection
/usr/sbin/lsof -i tcp | grep PID
netstat -anp | grep PID

Result:

Phase	TCP connection state
phase 1	ESTABLISHED
phase 2	CLOSE_WAIT
phase 3	CLOSE_WAIT

The result shows that it will transit to CLOSE_WAIT state right after it has read to EOF. In this scenario, the HDFS server is the 'active' end.

It will stay at CLOSE_WAIT state no matter we call close() method or not. However, the TCP connection in CLOSE_WAIT will disappear after some time, which is before the java process dies, whereas it will last to the time the java process dies if we do not call close() method after reading the file.

Thus, it is just a normal TCP connection no matter how many CLOSE_WAIT TCP connections there are provided we remember to call close() method every time.

© 2014-2017 jason4zhu.blogspot.com All Rights Reserved
If transfering, please annotate the origin: Jason4Zhu