Wednesday, July 11, 2018

MR/TEZ作业数据倾斜导致OOM问题排查思路

对于如下sql,可能会出现最后几个reducer task失败,日志显示OutOfMemory Exception.
select *
from a
join b
on a.uid = b.uid


这种情况一般是由于join的key存在严重倾斜导致的,所以需要分别看下在a表和b表里,uid的分布情况:
select uid, count(1) as cnt
from a
group by uid
order by cnt desc
limit 1000
实际情况可能是如下图所示。


如果在sql中通过where filter去掉这些uid,则任务成功。对于这些倾斜的value,可以分开单独处理(通过增加reducer内存等方式)。

Friday, July 6, 2018

udf包污染导致tez任务在am端log4j stackoverlow问题排查过程

发现有log4j-over-slf4j,和slf4j-log4j12, 两者会循环引用,导致log4j相关的stackoverflow.

报错如下:
Exception in thread "main" java.lang.StackOverflowError
at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936)
at org.apache.log4j.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:39)
at org.apache.log4j.LogManager.getLogger(LogManager.java:45)
at org.slf4j.impl.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:64)
at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:358)
at org.apache.log4j.Category.<init>(Category.java:57)
at org.apache.log4j.Logger.<init>(Logger.java:37)
at org.apache.log4j.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:43)
at org.apache.log4j.LogManager.getLogger(LogManager.java:45)
at org.slf4j.impl.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:64)
at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:358)
at org.apache.log4j.Category.<init>(Category.java:57)
at org.apache.log4j.Logger.<init>(Logger.java:37)
at org.apache.log4j.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:43)
at org.apache.log4j.LogManager.getLogger(LogManager.java:45)
at org.slf4j.impl.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:64)
at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:358)
at org.apache.log4j.Category.<init>(Category.java:57)
at org.apache.log4j.Logger.<init>(Logger.java:37)
at org.apache.log4j.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:43)
at org.apache.log4j.LogManager.getLogger(LogManager.java:45)
at org.slf4j.impl.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:64)
at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:358)
at org.apache.log4j.Category.<init>(Category.java:57)
at org.apache.log4j.Logger.<init>(Logger.java:37)
at org.apache.log4j.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:43)
at org.apache.log4j.LogManager.getLogger(LogManager.java:45)
at org.slf4j.impl.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:64)
at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:358)
at org.apache.log4j.Category.<init>(Category.java:57)
at org.apache.log4j.Logger.<init>(Logger.java:37)
at org.apache.log4j.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:43)
at org.apache.log4j.LogManager.getLogger(LogManager.java:45)
at org.slf4j.impl.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:64)
at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:358)

exclude掉log4j-over-slf4j和log-to-slf4j即可。

反思:在udf中引用任何包后都要通过`mvn dependency:tree`和`jar -tf ...`命令查看引用的包包含的package和namespace,以防止后期的包污染问题扩散。

REF