Thursday, June 28, 2018

多个count(distinct)导致data skew的优化策略

hive中对于count(distinct)的执行逻辑大体是,在mapper端用HashSet将key去重后,全部发送给1个reducer再做去重,这样的问题在于会有单点问题。如果只有一个count(distinct),则通过设置SET hive.groupby.skewindata = true;可以使执行逻辑自动优化避免data skew. 但如果有两个及以上则上述参数不会起作用。
具体case:
如下sql每天会执行11h,切换成优化后的select count(1) from (select * from ... group by ...) tmp之后,只需要28min即可。
--before optimization
select
COUNT(DISTINCT (CASE WHEN field_a = 1 THEN field_b ELSE NULL END)) / COUNT(DISTINCT field_b)
from
table_name
where
label_spam_user = 0
--after optimization
with tmp1 as(
select
field_b
from table_name
where field_a = 1
group by field_b
)
, tmp2 as(
select
count(1) as cnt_1
from tmp1
)
, tmp3 as(
select
field_b
from table_name
group by field_b
)
, tmp4 as(
select
count(1) as cnt_2
from tmp3
)
select
cnt_1/cnt_2
from tmp2
join tmp4 on 1=1;

Tuesday, June 12, 2018

HiveOnTez: 包冲突问题排查思路


执行tez报错,观察YARN日志,Driver端报错为:File does not exist: hdfs://lt-nameservice3.sy/tmp/hive/app/_tez_session_dir/09ff9062-cc3e-4cb3-bc8d-77c275266d94/.tez/application_1528552108294_273009/tez.session.local-resources.pb java.io.FileNotFoundException
此时进入ApplicationMaster的log观察,根本报错内容如下,显然为guava包冲突导致(guava21以上会移除一些method接口不再向前兼容)。
Caused by: java.lang.NoSuchMethodError: com.google.common.base.Objects.toStringHelper(Ljava/lang/Object;)Lcom/google/common/base/Objects$ToStringHelper;
at org.apache.hadoop.metrics2.lib.MetricsRegistry.toString(MetricsRegistry.java:406)
at java.lang.String.valueOf(String.java:2994)
at java.lang.StringBuilder.append(StringBuilder.java:131)
at org.apache.hadoop.ipc.metrics.RpcMetrics.<init>(RpcMetrics.java:74)
at org.apache.hadoop.ipc.metrics.RpcMetrics.create(RpcMetrics.java:80)
at org.apache.hadoop.ipc.Server.<init>(Server.java:2213)
at org.apache.hadoop.ipc.RPC$Server.<init>(RPC.java:1029)
at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.<init>(ProtobufRpcEngine.java:537)
at org.apache.hadoop.ipc.ProtobufRpcEngine.getServer(ProtobufRpcEngine.java:512)
at org.apache.hadoop.ipc.RPC$Builder.build(RPC.java:874)
at org.apache.tez.dag.api.client.DAGClientServer.createServer(DAGClientServer.java:127)
at org.apache.tez.dag.api.client.DAGClientServer.serviceStart(DAGClientServer.java:79)
at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
at org.apache.tez.dag.app.DAGAppMaster$ServiceWithDependency.start(DAGAppMaster.java:1838)
at org.apache.tez.dag.app.DAGAppMaster$ServiceThread.run(DAGAppMaster.java:1859)

科普下Driver日志和ApplicationMaster日志查看位置:
如图,从YARN页面点击application id进入的页面为Driver页面,红框内容为Driver端报错信息;点击蓝框才会进入ApplicationMaster日志。


此时,在ApplicationMaster中搜索java.class.path可以拿到当前am环境下的所有classpath,从url中获取当前am所在节点,把所有jar包拉出来,找下包版本冲突所在jar包即可。如果为自己jar包里的dependency冲突,或shade,或exclude即可解决。

相关命令如下:

jar -tf platform_udf-1.0-SNAPSHOT.jar | grep -i com.google.common
javap -classpath platform_udf-1.0-SNAPSHOT.jar com.google.common.base.Objects