Friday, June 12, 2015

Troubleshooting On SSH hangs and Machine Stuck When Massive IO-Intensive Processes Are Running

It is inevitable that some users will write and execute devastating code inadvertently on public service-providing machines. Hence, we should enhance the robustness of our machine to the greatest extent.

Today, one of our users forked processes as many as possible, all executing `hadoop get`. As a result, it ate up all of our IO resources even though I've constrained CPU and Memory via `cgroups` upon this specific user. At this time, the phenomenon is that we could neither interact with the prompt nor ssh to this machine anymore. I managed to invoke `top` command before it was fully stuck. This is what it shows:
top - 18:26:10 up 238 days,  5:43,  3 users,  load average: 1782.01, 1824.47, 1680.36
Tasks: 1938 total,   1 running, 1937 sleeping,   0 stopped,   0 zombie
Cpu(s):  2.4%us,  3.0%sy,  0.0%ni,  0.0%id, 94.5%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  65923016k total, 65698400k used,   224616k free,    13828k buffers
Swap: 33030136k total, 17799704k used, 15230432k free,   157316k cached

As you can see, load average is at an unacceptable peak, and %wa is above 90 (%wa - iowait: Amount of time the CPU has been waiting for I/O to complete). In the meantime, memory has swallowed almost all memory with 17GB swap space being used. Obviously, this is due to the too many processes executing `hadoop get` command.

Eventually, I set the maximum number of processes and open files that the user can fork and open respectively. In this way, conditions will be alleviated to an acceptable scenario.

The configuration is at '/etc/security/limits.conf', we have to edit it in root user. Append the following content to the file:
username      soft    nofile  5000
username      hard    nofile  5000
username      soft    nproc   100
username      hard    nproc   100

Also, there's a nice post regarding how to catch the culprit causing high IO wait in linux.

Reference:
1. Limiting Maximum Number of Processes Available for the Oracle User
2. how to change the maximum number of fork process by user in linux
3. ulimit: difference between hard and soft limits
4. High on %wa from top command, is there any way to constrain it
5. How to ensure ssh via cgroups on centos
6. fork and exec in bash



No comments:

Post a Comment