Jason4Zhu: 多个count(distinct)导致data skew的优化策略

Thursday, June 28, 2018

多个count(distinct)导致data skew的优化策略

hive中对于count(distinct)的执行逻辑大体是，在mapper端用HashSet将key去重后，全部发送给1个reducer再做去重，这样的问题在于会有单点问题。如果只有一个count(distinct)，则通过设置SET hive.groupby.skewindata = true;可以使执行逻辑自动优化避免data skew. 但如果有两个及以上则上述参数不会起作用。

具体case:

如下sql每天会执行11h，切换成优化后的select count(1) from (select * from ... group by ...) tmp之后，只需要28min即可。


--before optimization




select

  COUNT(DISTINCT (CASE WHEN field_a = 1 THEN field_b ELSE NULL END)) / COUNT(DISTINCT field_b)

from

  table_name

where

  label_spam_user = 0







--after optimization




with tmp1 as(

    select 

        field_b

    from table_name

    where field_a = 1

    group by field_b

)

, tmp2 as(

    select

        count(1) as cnt_1

    from tmp1

)

, tmp3 as(

    select 

        field_b

    from table_name

    group by field_b

)

, tmp4 as(

    select

        count(1) as cnt_2

    from tmp3

)

select 

    cnt_1/cnt_2

from tmp2

join tmp4 on 1=1;

18 comments:

pragyachitraOctober 1, 2018 at 4:57 AM
I would like to thank you for your nicely written post, its informative and your writing style encouraged me to read it till end. Thanks

angularjs Training in chennai
angularjs Training in chennai

angularjs-Training in tambaram

angularjs-Training in sholinganallur

angularjs-Training in velachery
ReplyDelete
Replies
saiOctober 12, 2018 at 12:33 AM
I would really like to read some personal experiences like the way, you've explained through the above article. I'm glad for your achievements and would probably like to see much more in the near future. Thanks for share.
Python training in marathahalli | Python training institute in pune
ReplyDelete
Replies
UnknownOctober 17, 2018 at 12:01 AM
It would have been the happiest moment for you,I mean if we have been waiting for something to happen and when it happens we forgot all hardwork and wait for getting that happened.
Java training in Marathahalli | Java training in Btm layout

Java training in Marathahalli | Java training in Btm layout
ReplyDelete
Replies
seomonsterJanuary 29, 2019 at 10:48 PM
Thank you so much for ding the impressive job here, everyone will surely like your post. Order Form
ReplyDelete
Replies
saiMarch 13, 2019 at 4:49 AM
You’ve written a really great article here. Your writing style makes this material easy to understand.. I agree with some of the many points you have made. Thank you for this is real thought-provoking content
Microsoft Azure online training
Selenium online training
Java online training
Python online training
uipath online training

ReplyDelete
Replies
AnonymousJune 14, 2019 at 5:35 AM
This comment has been removed by the author.
ReplyDelete
Replies
AnonymousJuly 12, 2019 at 5:52 AM
This comment has been removed by the author.
ReplyDelete
Replies
AnonymousJuly 16, 2019 at 5:56 AM
This comment has been removed by the author.
ReplyDelete
Replies
FaizalMay 20, 2020 at 2:42 AM
I perceived a lots of Knowledge from this Resource. This is a Good way to Circulate the Educated Things...Hope people will also Like the Below Information's also...
Looking For BEST JAVA TRAINING IN CHENNAI WITH PLACEMENT Visit Below...
Java training in chennai | Java training in annanagar | Java training in omr | Java training in porur | Java training in tambaram | Java training in velachery
ReplyDelete
Replies
IICTJune 3, 2020 at 6:53 AM
Informative blog post. Thanks for this wonderful Post.
SAP Training in Chennai
AWS Training in Chennai
Hardware and Networking Training in Chennai
QTP Training in Chennai
CCNA Training in Chennai
ReplyDelete
Replies
TIC AcademyJune 12, 2020 at 6:54 AM
Great Blog. Thnaks.
SAP Training in Chennai
Java Training in Chennai
Software Testing Training in Chennai
.Net Training in Chennai
Hardware and Networking Training in Chennai
AWS Training in Chennai
Azure Training in Chennai
Selenium Training in Chennai
QTP Training in Chennai
Android Training in Chennai
ReplyDelete
Replies
Matt ReevesMarch 18, 2022 at 9:52 PM
Mindblowing blog very useful thanks
AWS Training in Velachery
AWS Training in Chennai
ReplyDelete
Replies
AainaApril 22, 2022 at 8:37 AM
I feel very grateful that I read this. It is very helpful and very informative and I really learned a lot from it.

MLSU BA 1st Year Exam Result
MLSU BA 2nd Year Exam Result
MLSU BA 3rd Year Exam Result
ReplyDelete
Replies
milkaAugust 10, 2022 at 2:41 AM

Great post. keep sharing such a worthy information.
Python Institute In Chennai
ReplyDelete
Replies
vcubeJanuary 18, 2024 at 9:12 PM
Superb Information, I really appreciated with it, This is fine to read and valuable pro potential, I really bookmark it, pro broaden read. Appreciation pro sharing. I like it.
Best React-js Training Institute in Hyderabad
ReplyDelete
Replies
seodigiperformOctober 23, 2024 at 2:20 AM
Great insights on optimizing the COUNT DISTINCT function in Hive SQL when dealing with data skew. This is a common issue in big data processing, and your solution can greatly improve query performance. It's fascinating how optimizing data operations is critical, much like optimizing digital marketing strategies for better ROI. If anyone is looking to enhance their skills, consider exploring Digiperform's link text for practical insights into data-driven marketing techniques. Thanks for sharing these valuable tips!
ReplyDelete
Replies
Best Software Institute HydApril 14, 2025 at 12:49 AM
seodigiperform

mern stack course in bangalore,
mern stack developer course in bangalore,
mern stack training in bangalore
ReplyDelete
Replies
anithaAugust 30, 2025 at 1:15 AM
This optimization still works wonders in 2025. Hive's performance has improved, but managing data skew remains crucial. Appreciate how well the post broke it down!
data science internship |
python internship |
artificial intelligence internship |
java internship |
cyber security internship
ReplyDelete
Replies

Add comment