Sunday, April 15, 2018

Spark Parameter Tuning Cases Due to FullGC and Data Skew

The default Spark setting is as follows:

> spark.executor.memory 8g
> spark.executor.cores 4
> spark.dynamicAllocation.enabled true
> spark.dynamicAllocation.initialExecutors 1
> spark.dynamicAllocation.maxExecutors 200
> spark.dynamicAllocation.minExecutors 1

when running SQL as below, it shows that GC time occupies virtually as much as the running time, which, apparently, is due to a general memory issue. Overall, it took 26min to finish.

SELECT a, b, c, d, e,
    avg(f/g) AS x1,
    percentile(cast(f/g AS bigint), array(0.5, 0.95)) AS x2
FROM (
    SELECT a, b, c, d, e,
        h, sum(i) AS f, sum(1) AS g
    FROM tbl_name
    WHERE p_date = '20180303'
    GROUP BY a, b, c, d, e,
        h
) AS TZ
GROUP BY a, b, c, d, e WITH CUBE
SORT BY a, b DESC, c, d, e;


after updating the following parameters, execution time reduces to 16min and GC time becomes much less dominant.

> set spark.executor.extraJavaOptions=-XX:+UseG1GC;
> set spark.yarn.executor.memoryOverhead=8g;
> set spark.executor.memory=24g;

Yet the following issue is data skew. there's stragglers running for a long time whereas the others have already finished quickly.

 This is because of the default value for both `spark.sql.shuffle.partitions` and `spark.default.parallelism`, a small setting value and large amount of data is more likely to lead to data skew and stragglers issue. After updating as follows, execution time goes down to 705s in total.

> set spark.sql.shuffle.partitions=2467;
> set spark.default.parallelism=127;


REFERENCE:
1. https://stackoverflow.com/questions/45704156/what-is-the-difference-between-spark-sql-shuffle-partitions-and-spark-default-pa
2. https://www.jianshu.com/p/06b67a3c61a9
3. https://databricks.com/blog/2015/05/28/tuning-java-garbage-collection-for-spark-applications.html
4. https://www.ibm.com/support/knowledgecenter/en/SS3H8V_1.1.0/com.ibm.izoda.v1r1.azka100/topics/azkic_t_configmemcpu.htm
5. http://spark.apache.org/docs/latest/tuning.html#garbage-collection-tuning
6. http://www.iteye.com/news/31303





8 comments:

  1. The actual time and effort took to create this wonderful article were really great and thanks for sharing here please do keep updating us...
    Apache Spark Training

    ReplyDelete
  2. Thank you for sharing wonderful information with us to get some idea about that content. check it once through
    Machine Learning With TensorFlow Training and Course in Tel Aviv
    | CPHQ Online Training in Beirut. Get Certified Online

    ReplyDelete
  3. nice article
    Unicsol offers Best full stack developer course in hyderabadget trained by 10+years of experienced faculty and get placed as full stack developer.

    ReplyDelete
  4. I like your post very much. It is very much useful for my research. I hope you to share more info about this. Keep posting Spark Online Training India

    ReplyDelete
  5. Thanks Mr jasonZhu to share ur experienced knowledge in Spark.
    I request please start your bigdata knowledge in the form of articles.
    its very useful.
    Venu spark training provider in Hyderabad


    bigdata training in Hyderabad
    https://bigdatasolutionss.blogspot.com/2019/01/limitations-of-apache-spark.html

    ReplyDelete
  6. JSTOR - Casino - Las Vegas - JTM Hub
    JSTOR is the 동해 출장마사지 new casino in the Las Vegas Strip. With 충청북도 출장안마 just over 3,000 slots 천안 출장안마 and hundreds of table 정읍 출장샵 games, 거제 출장샵 JSTOR is a premier gambling venue. JSTOR is located at

    ReplyDelete