Study on Spark performance tuning strategy based on Skewed Partitioning and Locality-aware Partitioning

Guikun Cao, Haiyuan Yu, Liujia Chang, Heng Zhao

Abstract


Apache Spark is a large-scale data processing engine, widely used in a variety of big data analysis tasks. However, data
skew and data locality issues can cause performance degradation in Spark applications. This paper investigates Spark performance tuning
strategies based on Skewed Partitioning and Locality-aware Partitioning. Firstly, the infl uence of data skew and data localizability problems
in Spark is analyzed, and then a perfo rmance tuning method combining Skewed Partitioning and Locality-aware Partitioning is proposed.
Experimental results show that this method can significantly improve the efficiency of Spark jobs when dealing with large data sets,
compared with traditional HashPartitioner.

Keywords


Skewed Partitioning; Locality-aware Partitioning; Performance tuning; Spark; Data skew

Full Text:

PDF

Included Database


References


[1] Wei Jin,Peng Liu,Ru Li. A review of data skew problems based on Spark [J]. Computer Science, 2021, 48(2): 89-97.

[2] Wanhang Xie,Hongwei Yuan,Fanyi Liu, etal. Overview of Spark SQL Optimization Algorithm [J]. Computer Engineering and Design, 2020, 41(1): 7-13.

[3] Weichao Guo,Yong Yang,Bo Pan, etal. Research review of Spark framework in Big Data analysis [J]. Computer Engineering and Design, 2019, 40(5):

1052-1060.

[4] Fei Hu. Research on Optimization of large-scale Data Processing Architecture Based on Spark [J]. Modern Computer, 2018(22): 72-74.

[5] Yichen Fang,Liping Zhang. Review of data analysis and processing methods based on Spark [J]. Well Logging Technology, 2018, 42(1): 69-75.

[6] Hongjie Chen,Yu Huang. A review of data analysis technology based on Apache Spark. Computer and Digital Engineering, 2017, 45(7): 1321-1329.




DOI: https://doi.org/10.18686/esta.v10i4.572

Refbacks

  • There are currently no refbacks.


Copyright (c) 2023 Guikun Cao,Haiyuan Yu,Liujia Chang,Heng Zhao