Wednesday, July 11, 2018

MR/TEZ作业数据倾斜导致OOM问题排查思路

对于如下sql,可能会出现最后几个reducer task失败,日志显示OutOfMemory Exception.
select *
from a
join b
on a.uid = b.uid


这种情况一般是由于join的key存在严重倾斜导致的,所以需要分别看下在a表和b表里,uid的分布情况:
select uid, count(1) as cnt
from a
group by uid
order by cnt desc
limit 1000
实际情况可能是如下图所示。


如果在sql中通过where filter去掉这些uid,则任务成功。对于这些倾斜的value,可以分开单独处理(通过增加reducer内存等方式)。

Friday, July 6, 2018

udf包污染导致tez任务在am端log4j stackoverlow问题排查过程

发现有log4j-over-slf4j,和slf4j-log4j12, 两者会循环引用,导致log4j相关的stackoverflow.

报错如下:
Exception in thread "main" java.lang.StackOverflowError
at java.util.concurrent.ConcurrentHashMap.get(ConcurrentHashMap.java:936)
at org.apache.log4j.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:39)
at org.apache.log4j.LogManager.getLogger(LogManager.java:45)
at org.slf4j.impl.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:64)
at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:358)
at org.apache.log4j.Category.<init>(Category.java:57)
at org.apache.log4j.Logger.<init>(Logger.java:37)
at org.apache.log4j.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:43)
at org.apache.log4j.LogManager.getLogger(LogManager.java:45)
at org.slf4j.impl.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:64)
at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:358)
at org.apache.log4j.Category.<init>(Category.java:57)
at org.apache.log4j.Logger.<init>(Logger.java:37)
at org.apache.log4j.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:43)
at org.apache.log4j.LogManager.getLogger(LogManager.java:45)
at org.slf4j.impl.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:64)
at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:358)
at org.apache.log4j.Category.<init>(Category.java:57)
at org.apache.log4j.Logger.<init>(Logger.java:37)
at org.apache.log4j.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:43)
at org.apache.log4j.LogManager.getLogger(LogManager.java:45)
at org.slf4j.impl.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:64)
at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:358)
at org.apache.log4j.Category.<init>(Category.java:57)
at org.apache.log4j.Logger.<init>(Logger.java:37)
at org.apache.log4j.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:43)
at org.apache.log4j.LogManager.getLogger(LogManager.java:45)
at org.slf4j.impl.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:64)
at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:358)
at org.apache.log4j.Category.<init>(Category.java:57)
at org.apache.log4j.Logger.<init>(Logger.java:37)
at org.apache.log4j.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:43)
at org.apache.log4j.LogManager.getLogger(LogManager.java:45)
at org.slf4j.impl.Log4jLoggerFactory.getLogger(Log4jLoggerFactory.java:64)
at org.slf4j.LoggerFactory.getLogger(LoggerFactory.java:358)

exclude掉log4j-over-slf4j和log-to-slf4j即可。

反思:在udf中引用任何包后都要通过`mvn dependency:tree`和`jar -tf ...`命令查看引用的包包含的package和namespace,以防止后期的包污染问题扩散。

REF 

Thursday, June 28, 2018

多个count(distinct)导致data skew的优化策略

hive中对于count(distinct)的执行逻辑大体是,在mapper端用HashSet将key去重后,全部发送给1个reducer再做去重,这样的问题在于会有单点问题。如果只有一个count(distinct),则通过设置SET hive.groupby.skewindata = true;可以使执行逻辑自动优化避免data skew. 但如果有两个及以上则上述参数不会起作用。
具体case:
如下sql每天会执行11h,切换成优化后的select count(1) from (select * from ... group by ...) tmp之后,只需要28min即可。
--before optimization
select
COUNT(DISTINCT (CASE WHEN field_a = 1 THEN field_b ELSE NULL END)) / COUNT(DISTINCT field_b)
from
table_name
where
label_spam_user = 0
--after optimization
with tmp1 as(
select
field_b
from table_name
where field_a = 1
group by field_b
)
, tmp2 as(
select
count(1) as cnt_1
from tmp1
)
, tmp3 as(
select
field_b
from table_name
group by field_b
)
, tmp4 as(
select
count(1) as cnt_2
from tmp3
)
select
cnt_1/cnt_2
from tmp2
join tmp4 on 1=1;

Tuesday, June 12, 2018

HiveOnTez: 包冲突问题排查思路


执行tez报错,观察YARN日志,Driver端报错为:File does not exist: hdfs://lt-nameservice3.sy/tmp/hive/app/_tez_session_dir/09ff9062-cc3e-4cb3-bc8d-77c275266d94/.tez/application_1528552108294_273009/tez.session.local-resources.pb java.io.FileNotFoundException
此时进入ApplicationMaster的log观察,根本报错内容如下,显然为guava包冲突导致(guava21以上会移除一些method接口不再向前兼容)。
Caused by: java.lang.NoSuchMethodError: com.google.common.base.Objects.toStringHelper(Ljava/lang/Object;)Lcom/google/common/base/Objects$ToStringHelper;
at org.apache.hadoop.metrics2.lib.MetricsRegistry.toString(MetricsRegistry.java:406)
at java.lang.String.valueOf(String.java:2994)
at java.lang.StringBuilder.append(StringBuilder.java:131)
at org.apache.hadoop.ipc.metrics.RpcMetrics.<init>(RpcMetrics.java:74)
at org.apache.hadoop.ipc.metrics.RpcMetrics.create(RpcMetrics.java:80)
at org.apache.hadoop.ipc.Server.<init>(Server.java:2213)
at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:1029)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.<init>(ProtobufRpcEngine.java:537)
at org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:512)
at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:874)
at org.apache.tez.dag.api.client.DAGClientServer.createServer(DAGClientServer.java:127)
at org.apache.tez.dag.api.client.DAGClientServer.serviceStart(DAGClientServer.java:79)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.tez.dag.app.DAGAppMaster$ServiceWithDependency.start(DAGAppMaster.java:1838)
at org.apache.tez.dag.app.DAGAppMaster$ServiceThread.run(DAGAppMaster.java:1859)

科普下Driver日志和ApplicationMaster日志查看位置:
如图,从YARN页面点击application id进入的页面为Driver页面,红框内容为Driver端报错信息;点击蓝框才会进入ApplicationMaster日志。


此时,在ApplicationMaster中搜索java.class.path可以拿到当前am环境下的所有classpath,从url中获取当前am所在节点,把所有jar包拉出来,找下包版本冲突所在jar包即可。如果为自己jar包里的dependency冲突,或shade,或exclude即可解决。

相关命令如下:

jar -tf platform_udf-1.0-SNAPSHOT.jar | grep -i com.google.common
javap -classpath platform_udf-1.0-SNAPSHOT.jar com.google.common.base.Objects


Monday, May 28, 2018

HTTP部分时候请求长时间卡死超时问题排查思路

部分业务方用户反映有些时候hue刷新页面会卡死2-5min无响应。个人也遇到过几次,在页面访问卡住的同时,通过curl url:8890 -I -L一样会卡住无法返回页面内容。
通过记录事故时间点反查linux TCP相关信息,发现在卡死的时刻,TCP connection数量骤增到2-3k (ss -s),同时TCcpExt的ListenDrops和TcpExt.ListenOverflows会相对较高:


此时感觉是因为sysctl中关于TCP连接的部分参数没有优化导致,所以添加了如下参数并重启了hue服务,但问题依旧。
net.ipv4.tcp_tw_reuse = 1
net.ipv4.tcp_tw_recycle = 1
net.ipv4.tcp_fin_timeout = 10
net.ipv4.tcp_keepalive_time = 1200
net.ipv4.tcp_max_syn_backlog = 65535
net.ipv4.tcp_max_tw_buckets = 400000
net.ipv4.tcp_orphan_retries = 1
net.ipv4.tcp_retries2 = 5
net.core.netdev_max_backlog = 512000
net.ipv4.tcp_syn_retries = 3
net.ipv4.tcp_synack_retries = 3
net.ipv4.tcp_slow_start_after_idle = 0
vm.min_free_kbytes = 360448
vm.zone_reclaim_mode = 1
net.core.somaxconn = 65535
#net.netfilter.nf_conntrack_max = 1655350
#net.netfilter.nf_conntrack_tcp_timeout_established = 7200
#net.netfilter.nf_conntrack_tcp_timeout_time_wait = 10
kernel.shmall = 4294967296
kernel.shmmax = 68719476736
kernel.msgmax = 65536
kernel.msgmnb = 65536
此时想关注下到底是哪个进程导致了TCP连接数骤增,ss -n后发现主要为:50010端口。netstat -nlp | grep :50010后发现是DataNode服务。
此时通过dmesg发现如下报错内容:
[12522087.589828] TCP: request_sock_TCP: Possible SYN flooding on port 50010. Sending cookies. Check SNMP counters.
[12547933.126955] TCP: request_sock_TCP: Possible SYN flooding on port 50010. Sending cookies. Check SNMP counters.
也即TCP连接已经全被50010吃满了,导致新来的hue请求无法正常建立connection导致用户角度的卡死和超时(参考:https://help.marklogic.com/Knowledgebase/Article/View/182/0/possible-syn-flooding-messages-in-system-logs
同时,通过sar -n DEV 3发现下行带宽已经占了1.2GB/s, 已经达到机器对应的最大下行带宽,所以数据传输不出去也是可能的因素之一。
最终,将混部的服务拆分到独立的节点,问题解决。

Wednesday, May 16, 2018

hue后端的django请求HOST为domain时返回400 BAD REQUEST,ip+port则没问题排查思路

通过nginx反向代理到两台hue webserver服务后,发现wget dp-hue_nginx.company.com/hue会返回400 BAD REQUEST, 而请求服务对应的ip:portwget 192.168.1.121:8005/hue则没有问题。

继续找到问题来源,直接对某台hue webserver执行如下命令,发现上述问题可以复现,所以排除掉nginx的问题。
$ curl -I -H 'Host:dp-hue_nginx.corp.kuaishou.com' http://hue-webserver/accounts/login/?next=/
HTTP/1.1 400 BAD REQUEST
$ curl -I -H 'Host:127.0.0.1:8888' http://hue-webserver/accounts/login/?next=/
HTTP/1.1 200 OK
google得到相关post, 说设置allowed_hosts即可,检查$HUE_HOME/desktop/conf/hue.ini, 此参数默认为'*'.
再次google得到此文, 发现是个django的bug。在请求的host domain路径里不能包括下划线(underscore)。

Monday, May 7, 2018

Spark执行卡住或过慢时从YARN监控页排查思路


  • 在YARN-Stages tab,检查卡住/很慢的stage对应的executor数量,如果executor数量很少,同时对应后面的shuffle read size或者records数量很大(图1),则很可能是因为没有开启spark.dynamicAllocation.enabled。开启配置如下: 

spark.dynamicAllocation.enabled true
spark.dynamicAllocation.initialExecutors 1
spark.dynamicAllocation.minExecutors 1
spark.dynamicAllocation.maxExecutors 300
spark.shuffle.service.enabled true
  • 如果某个很慢或者卡住的stage对应的task数量为200(图2),则应该注意是spark.sql.shuffle.partitions导致的,此param默认200,可以设置为2011等大值即可。同理,如果出现tasks数量为12,则应该是由于spark.default.parallelism参数。

  • 观察“Executor页面,如果Task Time(GC Time)背景飘红,说明gc时间过长。可以通过启动时添加set spark.executor.extraJavaOptions=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+PrintGC打印gc日志,从executor列表后面的stdout里查看。从优化角度讲,spark推荐使用G1GC。如果G1GC依旧出现上述问题,则可能当前在一个executor里并发的task数过多(task本身是一个算子(lambda),所以可能使当前的< 输入->输出 >后数据膨胀)。比如executor.memory为12G,executor.cores为4,则一共有4个task并行,每个task平均3g内存。如果减少cores数量,则可以变相提高每个task可使用的内存量。对于当前的case,从gc日志看出,heap space已经动态expand到12G,说明task的确需要消耗很多内存,所以只好调小cores数量从而降低gc time。

  • 在YARN-Jobs tab,可以看到所有stage列表,每项后面有Shuffle Read和Shuffle Write. 前者表示从上一个stage读取的shuffle数据数量,后者表示写出到下一个stage的shuffle数据数量。从这里可以可以粗略估计下当前stage所需的tasks数量。



  • REFERENCE: