Thursday, April 26, 2018

HiveOnSpark系列:hive.auto.convert.join issue

 于hive.auto.convert.join参数的解释如下:
Whether Hive enables the optimization about converting common
join into mapjoin based on the input file size
在Spark On Hive时,由于execution plan会在semantic analysis的时候生成,导致对于输入数据量相对小,但后面中间结果很多的task,会强行被计划成MapJoin,导致类似如下错误在yarn的task日志打印出:
java.lang.OutOfMemoryError: Java heap space
at org.apache.hadoop.hive.serde2.WriteBuffers.nextBufferToWrite(WriteBuffers.java:246)
at org.apache.hadoop.hive.serde2.WriteBuffers.write(WriteBuffers.java:222)
at org.apache.hadoop.hive.ql.exec.persistence.MapJoinBytesTableContainer$LazyBinaryKvWriter.writeKey(MapJoinBytesTableContainer.java:250)
at org.apache.hadoop.hive.ql.exec.persistence.BytesBytesMultiHashMap.put(BytesBytesMultiHashMap.java:428)
at org.apache.hadoop.hive.ql.exec.persistence.MapJoinBytesTableContainer.putRow(MapJoinBytesTableContainer.java:395)
at org.apache.hadoop.hive.ql.exec.persistence.MapJoinTableContainerSerDe.loadOptimized(MapJoinTableContainerSerDe.java:200)
类似的SQL如下,子查询的数据量在36亿条(select的column相对不大),但这些数据做笛卡尔积就会膨胀很多,如果还被以MapJoin方式执行,则会OOM。
select
a.dt as first_day, b.dt as rent_day
from (
SELECT dt, uid
from tbl
WHERE dt >= '20180320'
group by p_date,user_id
)a join (
SELECT dt, uid
from tbl
WHERE dt >= '20180320'
group by p_date,user_id
)b on a.uid=b.uid
limit 100;

No comments:

Post a Comment