A Data Mining Approach to Guide the Physical Design of Distributed Big Data Warehouses

Yassine Ramdane⁷,
Nadia Kabachi⁸,
Omar Boussaid⁷ &
…
Fadila Bentayeb⁷

Part of the book series: Studies in Computational Intelligence ((SCI,volume 1004))

241 Accesses
1 Citations

Abstract

Improving OLAP query performance in a distributed system such as Hadoop and Spark is a challenging task. An OLAP query is composed of several operations, such as projection, filtering, join, and grouping. The star join operation is the most expensive one and usually involve considerable communication cost. The common method used to decrease the network traffic for the star join operation is to co-partition some tables of a data warehouse on their join key. However, this operation still requires many MapReduce cycles in existing data warehouses partitioning schemes. In this paper, we propose a new physical design of distributed big data warehouses over Hadoop cluster. We propose two methods called “FKey” and “NewKey” based on a data mining technique to guide our physical design. Our partitioning and distribution scheme helps the query’s optimizer to make a good query processing plan, such it can performing the star join operation in only one Spark stage without the shuffle phase. To evaluate our approach we have done some experiments on a cluster of data nodes using the TPC-DS benchmark. The results show that our proposal outperforms the existing approaches in terms of query runtime.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.00; Price excludes VAT (USA)

Softcover Book: USD 179.99; Price excludes VAT (USA)

Hardcover Book: USD 179.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

SDWP: A New Data Placement Strategy for Distributed Big Data Warehouses in Hadoop

SkipSJoin: A New Physical Design for Distributed Big Data Warehouses in Hadoop

Atrak: a MapReduce-based data warehouse for big data

Article 21 April 2017

Notes

1.
$MSE=\sum _{j=1}^k\sum _{X_i\in C_j} \frac{\Vert X_i-C_j\Vert ^{2}}{n}$, Where $X_i$ denotes data point locations, i.e. tuples or vectors of the matrix MB, $C_j$ denotes centroid locations, and $n=|MB|$.
2.
available from the site https://github.com/databricks/spark-sql-perf.

References

Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D., Silberschatz, A., & Rasin, A. (2009). Hadoopdb: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. Proceedings of the VLDB Endowment, 2(1), 922–933.
Article Google Scholar
Afrati, F. N., & Ullman, J. D. (2011). Optimizing multiway joins in a map-reduce environment. IEEE Transactions on Knowledge and Data Engineering, 23(9), 1282–1298.
Article Google Scholar
Arres, B., Kabachi, N., & Boussaid, O. (2015). Optimizing olap cubes construction by improving data placement on multi-nodes clusters. In 2015 23rd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP) (pp. 520–524). IEEE.
Google Scholar
Azez, H., Khafagy, M. H., & Omara, F. A. (2015). Joum: An indexing methodology for improving join in hive star schema. International Journal of Scientific and Engineering Research, 6, 111–119.
Google Scholar
Blanas, S., Patel, J. M., Ercegovac, V., Rao, J., Shekita, E. J., & Tian, Y. (2010). A comparison of join algorithms for log processing in mapreduce. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (pp. 975–986). ACM.
Google Scholar
Brito, J. J., Mosqueiro, T., Ciferri, R. R., & de Aguiar Ciferri, C. D. (2016). Faster cloud star joins with reduced disk spill and network communication. Procedia Computer Science, 80, 74–85.
Article Google Scholar
Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., & Schad, J. (2010). Hadoop++: Making a yellow elephant run like a cheetah. Proceedings of the VLDB Endowment, 3(1–2), 515–529.
Article Google Scholar
Eltabakh, M. Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., & McPherson, J. (2011). Cohadoop: Flexible data placement and its exploitation in hadoop. Proceedings of the VLDB Endowment, 4(9), 575–585.
Article Google Scholar
Field, A. (2013). Discovering Statistics Using IBM SPSS Statistics. California: Sage.
Google Scholar
Golfarelli, M., & Baldacci, L. (2018). A cost model for spark sql. IEEE Transactions on Knowledge and Data Engineering.
Google Scholar
Gravetter, F. J., & Wallnau, L. B. (2016). Statistics for the Behavioral Sciences. Cengage Learning.
Google Scholar
Lu, Y., Shanbhag, A., Jindal, A., & Madden, S. (2017). Adaptdb: Adaptive partitioning for distributed joins. Proceedings of the VLDB Endowment, 10(5), 589–600.
Article Google Scholar
Malinen, M. I., & Fränti, P. (2014). Balanced k-means for clustering. In Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR) (pp. 32–41). Springer.
Google Scholar
Petridis, P., Gounaris, A., & Torres, J. (2016). Spark parameter tuning via trial-and-error. In INNS Conference on Big Data (pp. 226–237). Springer.
Google Scholar
Purdilă, V., & Pentiuc, Ş-G. (2016). Single-scan: A fast star-join query processing algorithm. Practice and Experience, 46(3), 319–339.
Article Google Scholar
Ramdane, Y., Boussaid, O., Kabachi, N., & Bentayeb, F. (2018). Partitioning and bucketing techniques to speed up query processing in spark-sql. In 2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS) (pp. 142–151). IEEE.
Google Scholar
Ramdane, Y., Omar, B., Nadia, K., & Fadila, B. (2019). Conception physique d’un entrepôt de données distribuées basée sur k-means équilibré. In EGC (pp. 177–188).
Google Scholar
Sun, L., Franklin, M. J., Krishnan, S., & Xin, R. S. (2014). Fine-grained partitioning for aggressive data skipping. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (pp. 1115–1126). ACM.
Google Scholar
Tang, Z., Zhang, X., Li, K., & Li, K. (2018). An intermediate data placement algorithm for load balancing in spark computing environment. Future Generation Computer Systems, 78, 287–301.
Article Google Scholar
Zamanian, E., Binnig, C., & Salama, A. (2015). Locality-aware partitioning in parallel database systems. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (pp. 17–30). ACM.
Google Scholar

Download references

Author information

Authors and Affiliations

University of Lyon, Lyon 2, ERIC EA 3083, Lyon, France
Yassine Ramdane, Omar Boussaid & Fadila Bentayeb
University of Lyon, University Claude Bernard Lyon 1, ERIC EA 3083, Lyon, France
Nadia Kabachi

Authors

Yassine Ramdane
View author publications
You can also search for this author in PubMed Google Scholar
Nadia Kabachi
View author publications
You can also search for this author in PubMed Google Scholar
Omar Boussaid
View author publications
You can also search for this author in PubMed Google Scholar
Fadila Bentayeb
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yassine Ramdane .

Editor information

Editors and Affiliations

LIASD, Paris 8 University, Saint-Denis, France
Rakia Jaziri
IRISA, University of Rennes, Lannion, France
Arnaud Martin
LIG (CNRS UMR 5217), Université Grenoble Alpes, Grenoble, France
Marie-Christine Rousset
LORIA, Université de Lorraine, Metz, France
Lydia Boudjeloud-Assala
LS2N (CNRS UMR 6004), Nantes University, Nantes, France
Fabrice Guillet

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Ramdane, Y., Kabachi, N., Boussaid, O., Bentayeb, F. (2022). A Data Mining Approach to Guide the Physical Design of Distributed Big Data Warehouses. In: Jaziri, R., Martin, A., Rousset, MC., Boudjeloud-Assala, L., Guillet, F. (eds) Advances in Knowledge Discovery and Management. Studies in Computational Intelligence, vol 1004. Springer, Cham. https://doi.org/10.1007/978-3-030-90287-2_6

Download citation

DOI: https://doi.org/10.1007/978-3-030-90287-2_6
Published: 15 March 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-90286-5
Online ISBN: 978-3-030-90287-2
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics

A Data Mining Approach to Guide the Physical Design of Distributed Big Data Warehouses

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

SDWP: A New Data Placement Strategy for Distributed Big Data Warehouses in Hadoop

SkipSJoin: A New Physical Design for Distributed Big Data Warehouses in Hadoop

Atrak: a MapReduce-based data warehouse for big data

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

A Data Mining Approach to Guide the Physical Design of Distributed Big Data Warehouses

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

SDWP: A New Data Placement Strategy for Distributed Big Data Warehouses in Hadoop

SkipSJoin: A New Physical Design for Distributed Big Data Warehouses in Hadoop

Atrak: a MapReduce-based data warehouse for big data

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation