> spark.executor.memory 8g
> spark.executor.cores 4
> spark.dynamicAllocation.enabled true
> spark.dynamicAllocation.initialExecutors 1
> spark.dynamicAllocation.maxExecutors 200
> spark.dynamicAllocation.minExecutors 1
when running SQL as below, it shows that GC time occupies virtually as much as the running time, which, apparently, is due to a general memory issue. Overall, it took 26min to finish.
SELECT a, b, c, d, e,
avg(f/g) AS x1,
percentile(cast(f/g AS bigint), array(0.5, 0.95)) AS x2
FROM (
SELECT a, b, c, d, e,
h, sum(i) AS f, sum(1) AS g
FROM tbl_name
WHERE p_date = '20180303'
GROUP BY a, b, c, d, e,
h
) AS TZ
GROUP BY a, b, c, d, e WITH CUBE
SORT BY a, b DESC, c, d, e;
after updating the following parameters, execution time reduces to 16min and GC time becomes much less dominant.
> set spark.executor.extraJavaOptions=-XX:+UseG1GC;
> set spark.yarn.executor.memoryOverhead=8g;
> set spark.executor.memory=24g;
Yet the following issue is data skew. there's stragglers running for a long time whereas the others have already finished quickly.
This is because of the default value for both `spark.sql.shuffle.partitions` and `spark.default.parallelism`, a small setting value and large amount of data is more likely to lead to data skew and stragglers issue. After updating as follows, execution time goes down to 705s in total.
> set spark.sql.shuffle.partitions=2467;
> set spark.default.parallelism=127;
REFERENCE:
1. https://stackoverflow.com/questions/45704156/what-is-the-difference-between-spark-sql-shuffle-partitions-and-spark-default-pa
2. https://www.jianshu.com/p/06b67a3c61a9
3. https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
4. https://www.ibm.com/support/knowledgecenter/en/SS3H8V_1.1.0/com.ibm.izoda.v1r1.azka100/topics/azkic_t_configmemcpu.htm
5. http://spark.apache.org/docs/latest/tuning.html#garbage-collection-tuning
6. http://www.iteye.com/news/31303
The actual time and effort took to create this wonderful article were really great and thanks for sharing here please do keep updating us...
ReplyDeleteApache Spark Training
Thank you for sharing wonderful information with us to get some idea about that content. check it once through
ReplyDeleteMachine Learning With TensorFlow Training and Course in Tel Aviv
| CPHQ Online Training in Beirut. Get Certified Online
nice article
ReplyDeleteUnicsol offers Best full stack developer course in hyderabadget trained by 10+years of experienced faculty and get placed as full stack developer.
I like your post very much. It is very much useful for my research. I hope you to share more info about this. Keep posting Spark Online Training India
ReplyDeleteNice Blog...
ReplyDeletebitwise aptitude questions
how to hack flipkart legally
zenq interview questions
count ways to n'th stair(order does not matter)
zeus learning subjective test
ajax success redirect to another page with data
l&t type 2 coordination chart
html rollover image
hack android phone using cmd
how to hack internet speed upto 100mbps
useful information..nice..
ReplyDeletedevops-engineer-resume-samples
digital-marketing-resume-samples
digital-marketing-resume-samples
electronics-engineer-resume-sample
engineering-lab-technician-resume-samples
english-teacher-cv-sample
english-teacher-resume-example
english-teacher-resume-sample
excel-expert-resume-sample
executive-secretary-resume-samples
Thanks Mr jasonZhu to share ur experienced knowledge in Spark.
ReplyDeleteI request please start your bigdata knowledge in the form of articles.
its very useful.
Venu spark training provider in Hyderabad
bigdata training in Hyderabad
https://bigdatasolutionss.blogspot.com/2019/01/limitations-of-apache-spark.html
JSTOR - Casino - Las Vegas - JTM Hub
ReplyDeleteJSTOR is the 동해 출장마사지 new casino in the Las Vegas Strip. With 충청북도 출장안마 just over 3,000 slots 천안 출장안마 and hundreds of table 정읍 출장샵 games, 거제 출장샵 JSTOR is a premier gambling venue. JSTOR is located at