The default Spark setting is as follows:
> spark.executor.memory
8g
> spark.executor.cores
4
> spark.dynamicAllocation.enabled
true
> spark.dynamicAllocation.initialExecutors
1
> spark.dynamicAllocation.maxExecutors
200
> spark.dynamicAllocation.minExecutors
1
when running SQL as below, it shows that GC time occupies virtually as much as the running time, which, apparently, is due to a general memory issue. Overall, it took 26min to finish.
SELECT a, b, c, d, e,
avg(f/g) AS x1,
percentile(cast(f/g AS bigint), array(0.5, 0.95)) AS x2
FROM (
SELECT a, b, c, d, e,
h, sum(i) AS f, sum(1) AS g
FROM tbl_name
WHERE p_date = '20180303'
GROUP BY a, b, c, d, e,
h
) AS TZ
GROUP BY a, b, c, d, e WITH CUBE
SORT BY a, b DESC, c, d, e;
after updating the following parameters, execution time reduces to 16min and GC time becomes much less dominant.
> set spark.executor.extraJavaOptions=-XX:+UseG1GC;
> set spark.yarn.executor.memoryOverhead=8g;
> set spark.executor.memory=24g;
Yet the following issue is data skew. there's stragglers running for a long time whereas the others have already finished quickly.
This is because of the default value for both `spark.sql.shuffle.partitions` and `spark.default.parallelism`, a small setting value and large amount of data is more likely to lead to data skew and stragglers issue. After updating as follows, execution time goes down to 705s in total.
> set spark.sql.shuffle.partitions=2467;
> set spark.default.parallelism=127;
REFERENCE:
1.
https://stackoverflow.com/questions/45704156/what-is-the-difference-between-spark-sql-shuffle-partitions-and-spark-default-pa
2.
https://www.jianshu.com/p/06b67a3c61a9
3. https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
4. https://www.ibm.com/support/knowledgecenter/en/SS3H8V_1.1.0/com.ibm.izoda.v1r1.azka100/topics/azkic_t_configmemcpu.htm
5. http://spark.apache.org/docs/latest/tuning.html#garbage-collection-tuning
6. http://www.iteye.com/news/31303