Monday, April 16, 2018

Hive的mr作业产生很多小文件或空文件的解决方案

据hive关于merge file的官方文档(keyword: hive.merge.*), 设置如下4个参数即可:
set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
set hive.merge.smallfiles.avgsize=16000000;
set hive.merge.size.per.task=67108864;
引用一段具体的scenario description:
By default hive.merge.smallfiles.avgsize=16000000 and hive.merge.size.per.task=256000000, so if the average file size is about 17MB, the merge job will not be triggered. Sometimes if we really want only 1 file being generated in the end, we need to increase hive.merge.smallfiles.avgsize to large enough to trigger the merge; and also you need to increase hive.merge.size.per.task to the get the needed number of files in the end.
REFERENCE:

No comments:

Post a Comment