Thursday, June 28, 2018

多个count(distinct)导致data skew的优化策略

hive中对于count(distinct)的执行逻辑大体是,在mapper端用HashSet将key去重后,全部发送给1个reducer再做去重,这样的问题在于会有单点问题。如果只有一个count(distinct),则通过设置SET hive.groupby.skewindata = true;可以使执行逻辑自动优化避免data skew. 但如果有两个及以上则上述参数不会起作用。
具体case:
如下sql每天会执行11h,切换成优化后的select count(1) from (select * from ... group by ...) tmp之后,只需要28min即可。
--before optimization
select
COUNT(DISTINCT (CASE WHEN field_a = 1 THEN field_b ELSE NULL END)) / COUNT(DISTINCT field_b)
from
table_name
where
label_spam_user = 0
--after optimization
with tmp1 as(
select
field_b
from table_name
where field_a = 1
group by field_b
)
, tmp2 as(
select
count(1) as cnt_1
from tmp1
)
, tmp3 as(
select
field_b
from table_name
group by field_b
)
, tmp4 as(
select
count(1) as cnt_2
from tmp3
)
select
cnt_1/cnt_2
from tmp2
join tmp4 on 1=1;

16 comments:

  1. I would really like to read some personal experiences like the way, you've explained through the above article. I'm glad for your achievements and would probably like to see much more in the near future. Thanks for share.
    Python training in marathahalli | Python training institute in pune

    ReplyDelete
  2. It would have been the happiest moment for you,I mean if we have been waiting for something to happen and when it happens we forgot all hardwork and wait for getting that happened.
    Java training in Marathahalli | Java training in Btm layout

    Java training in Marathahalli | Java training in Btm layout

    ReplyDelete
  3. Thank you so much for ding the impressive job here, everyone will surely like your post. Order Form

    ReplyDelete
  4. You’ve written a really great article here. Your writing style makes this material easy to understand.. I agree with some of the many points you have made. Thank you for this is real thought-provoking content
    Microsoft Azure online training
    Selenium online training
    Java online training
    Python online training
    uipath online training

    ReplyDelete
  5. This comment has been removed by the author.

    ReplyDelete
  6. This comment has been removed by the author.

    ReplyDelete
  7. This comment has been removed by the author.

    ReplyDelete
  8. I perceived a lots of Knowledge from this Resource. This is a Good way to Circulate the Educated Things...Hope people will also Like the Below Information's also...
    Looking For BEST JAVA TRAINING IN CHENNAI WITH PLACEMENT Visit Below...
    Java training in chennai | Java training in annanagar | Java training in omr | Java training in porur | Java training in tambaram | Java training in velachery

    ReplyDelete
  9. I feel very grateful that I read this. It is very helpful and very informative and I really learned a lot from it.

    MLSU BA 1st Year Exam Result
    MLSU BA 2nd Year Exam Result
    MLSU BA 3rd Year Exam Result

    ReplyDelete

  10. Great post. keep sharing such a worthy information.
    Python Institute In Chennai

    ReplyDelete
  11. Superb Information, I really appreciated with it, This is fine to read and valuable pro potential, I really bookmark it, pro broaden read. Appreciation pro sharing. I like it.
    Best React-js Training Institute in Hyderabad

    ReplyDelete
  12. Great insights on optimizing the COUNT DISTINCT function in Hive SQL when dealing with data skew. This is a common issue in big data processing, and your solution can greatly improve query performance. It's fascinating how optimizing data operations is critical, much like optimizing digital marketing strategies for better ROI. If anyone is looking to enhance their skills, consider exploring Digiperform's link text for practical insights into data-driven marketing techniques. Thanks for sharing these valuable tips!

    ReplyDelete