Abstract
Improving OLAP query performance in a distributed system such as Hadoop and Spark is a challenging task. An OLAP query is composed of several operations, such as projection, filtering, join, and grouping. The star join operation is the most expensive one and usually involve considerable communication cost. The common method used to decrease the network traffic for the star join operation is to co-partition some tables of a data warehouse on their join key. However, this operation still requires many MapReduce cycles in existing data warehouses partitioning schemes. In this paper, we propose a new physical design of distributed big data warehouses over Hadoop cluster. We propose two methods called “FKey” and “NewKey” based on a data mining technique to guide our physical design. Our partitioning and distribution scheme helps the query’s optimizer to make a good query processing plan, such it can performing the star join operation in only one Spark stage without the shuffle phase. To evaluate our approach we have done some experiments on a cluster of data nodes using the TPC-DS benchmark. The results show that our proposal outperforms the existing approaches in terms of query runtime.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
\(MSE=\sum _{j=1}^k\sum _{X_i\in C_j} \frac{\Vert X_i-C_j\Vert ^{2}}{n}\), Where \(X_i\) denotes data point locations, i.e. tuples or vectors of the matrix MB, \(C_j\) denotes centroid locations, and \(n=|MB|\).
- 2.
available from the site https://github.com/databricks/spark-sql-perf.
References
Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., & Rasin, A. (2009). Hadoopdb: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proceedings of the VLDB Endowment, 2(1), 922–933.
Afrati, F. N., & Ullman, J. D. (2011). Optimizing multiway joins in a map-reduce environment. IEEE Transactions on Knowledge and Data Engineering, 23(9), 1282–1298.
Arres, B., Kabachi, N., & Boussaid, O. (2015). Optimizing olap cubes construction by improving data placement on multi-nodes clusters. In 2015 23rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP) (pp. 520–524). IEEE.
Azez, H., Khafagy, M. H., & Omara, F. A. (2015). Joum: An indexing methodology for improving join in hive star schema. International Journal of Scientific and Engineering Research, 6, 111–119.
Blanas, S., Patel, J. M., Ercegovac, V., Rao, J., Shekita, E. J., & Tian, Y. (2010). A comparison of join algorithms for log processing in mapreduce. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (pp. 975–986). ACM.
Brito, J. J., Mosqueiro, T., Ciferri, R. R., & de Aguiar Ciferri, C. D. (2016). Faster cloud star joins with reduced disk spill and network communication. Procedia Computer Science, 80, 74–85.
Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., & Schad, J. (2010). Hadoop++: Making a yellow elephant run like a cheetah. Proceedings of the VLDB Endowment, 3(1–2), 515–529.
Eltabakh, M. Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., & McPherson, J. (2011). Cohadoop: Flexible data placement and its exploitation in hadoop. Proceedings of the VLDB Endowment, 4(9), 575–585.
Field, A. (2013). Discovering Statistics Using IBM SPSS Statistics. California: Sage.
Golfarelli, M., & Baldacci, L. (2018). A cost model for spark sql. IEEE Transactions on Knowledge and Data Engineering.
Gravetter, F. J., & Wallnau, L. B. (2016). Statistics for the Behavioral Sciences. Cengage Learning.
Lu, Y., Shanbhag, A., Jindal, A., & Madden, S. (2017). Adaptdb: Adaptive partitioning for distributed joins. Proceedings of the VLDB Endowment, 10(5), 589–600.
Malinen, M. I., & Fränti, P. (2014). Balanced k-means for clustering. In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR) (pp. 32–41). Springer.
Petridis, P., Gounaris, A., & Torres, J. (2016). Spark parameter tuning via trial-and-error. In INNS Conference on Big Data (pp. 226–237). Springer.
Purdilă, V., & Pentiuc, Ş-G. (2016). Single-scan: A fast star-join query processing algorithm. Practice and Experience, 46(3), 319–339.
Ramdane, Y., Boussaid, O., Kabachi, N., & Bentayeb, F. (2018). Partitioning and bucketing techniques to speed up query processing in spark-sql. In 2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS) (pp. 142–151). IEEE.
Ramdane, Y., Omar, B., Nadia, K., & Fadila, B. (2019). Conception physique d’un entrepôt de données distribuées basée sur k-means équilibré. In EGC (pp. 177–188).
Sun, L., Franklin, M. J., Krishnan, S., & Xin, R. S. (2014). Fine-grained partitioning for aggressive data skipping. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (pp. 1115–1126). ACM.
Tang, Z., Zhang, X., Li, K., & Li, K. (2018). An intermediate data placement algorithm for load balancing in spark computing environment. Future Generation Computer Systems, 78, 287–301.
Zamanian, E., Binnig, C., & Salama, A. (2015). Locality-aware partitioning in parallel database systems. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (pp. 17–30). ACM.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this chapter
Cite this chapter
Ramdane, Y., Kabachi, N., Boussaid, O., Bentayeb, F. (2022). A Data Mining Approach to Guide the Physical Design of Distributed Big Data Warehouses. In: Jaziri, R., Martin, A., Rousset, MC., Boudjeloud-Assala, L., Guillet, F. (eds) Advances in Knowledge Discovery and Management. Studies in Computational Intelligence, vol 1004. Springer, Cham. https://doi.org/10.1007/978-3-030-90287-2_6
Download citation
DOI: https://doi.org/10.1007/978-3-030-90287-2_6
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-90286-5
Online ISBN: 978-3-030-90287-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)