Today, one of our users forked processes as many as possible, all executing `hadoop get`. As a result, it ate up all of our IO resources even though I've constrained CPU and Memory via `cgroups` upon this specific user. At this time, the phenomenon is that we could neither interact with the prompt nor ssh to this machine anymore. I managed to invoke `top` command before it was fully stuck. This is what it shows:
top - 18:26:10 up 238 days, 5:43, 3 users, load average: 1782.01, 1824.47, 1680.36 Tasks: 1938 total, 1 running, 1937 sleeping, 0 stopped, 0 zombie Cpu(s): 2.4%us, 3.0%sy, 0.0%ni, 0.0%id, 94.5%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 65923016k total, 65698400k used, 224616k free, 13828k buffers Swap: 33030136k total, 17799704k used, 15230432k free, 157316k cached
As you can see, load average is at an unacceptable peak, and %wa is above 90 (%wa - iowait: Amount of time the CPU has been waiting for I/O to complete). In the meantime, memory has swallowed almost all memory with 17GB swap space being used. Obviously, this is due to the too many processes executing `hadoop get` command.
Eventually, I set the maximum number of processes and open files that the user can fork and open respectively. In this way, conditions will be alleviated to an acceptable scenario.
The configuration is at '/etc/security/limits.conf', we have to edit it in root user. Append the following content to the file:
username soft nofile 5000 username hard nofile 5000 username soft nproc 100 username hard nproc 100
Also, there's a nice post regarding how to catch the culprit causing high IO wait in linux.
Reference:
1. Limiting Maximum Number of Processes Available for the Oracle User
2. how to change the maximum number of fork process by user in linux
3. ulimit: difference between hard and soft limits
4. High on %wa from top command, is there any way to constrain it
5. How to ensure ssh via cgroups on centos
6. fork and exec in bash
No comments:
Post a Comment